Giter Club home page Giter Club logo

steineggerlab / metabuli Goto Github PK

View Code? Open in Web Editor NEW
62.0 5.0 6.0 17.17 MB

Metabuli: specific and sensitive metagenomic classification via joint analysis of DNA and amino acid.

License: GNU General Public License v3.0

CMake 0.23% C++ 72.23% C 25.94% Shell 0.63% Python 0.38% Dockerfile 0.01% Perl 0.01% Makefile 0.21% Batchfile 0.01% Meson 0.06% Lua 0.01% Starlark 0.02% HTML 0.20% Roff 0.05% R 0.01%
bioinformatics k-mer metagenomics taxonomic-classification taxonomy

metabuli's People

Contributors

jaebeom-kim avatar martin-steinegger avatar milot-mirdita avatar sjaenick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

metabuli's Issues

Metabuli mvdb run forever

Dear metabuli people,

I wanted to use metabuli and so I followed the steps and try to download refseq217 using this command line:

metabuli databases RefSeq217 databases/metabuli/refseq217 tmp

My only problem it that is been more than 160 hours than metabuli seems stuck on

metabuli mvdb tmp/16670562538200881729/refseq_release217+human database/metabuli/refseq217

Is that normal or is there any problem I don't know about.

After checking the end of the log it the last thing printed are:

2023-07-25 19:42:49 (49.6 MB/s) - ‘tmp/15725239404589206304/refseq_release217+human.tar.gz’ saved [329036572382/329036572382]

refseq_release217+human/
refseq_release217+human/info
refseq_release217+human/taxID_list
refseq_release217+human/diffIdx
refseq_release217+human/split
refseq_release217+human/acc2taxid.map
refseq_release217+human/taxonomy/
refseq_release217+human/taxonomy/fullnamelineage.dmp
refseq_release217+human/taxonomy/rankedlineage.dmp
refseq_release217+human/taxonomy/gencode.dmp
refseq_release217+human/taxonomy/typeoftype.dmp
refseq_release217+human/taxonomy/division.dmp
refseq_release217+human/taxonomy/nodes.dmp
refseq_release217+human/taxonomy/merged.dmp
refseq_release217+human/taxonomy/host.dmp
refseq_release217+human/taxonomy/names.dmp
refseq_release217+human/taxonomy/delnodes.dmp
refseq_release217+human/taxonomy/citations.dmp
refseq_release217+human/taxonomy/taxidlineage.dmp
refseq_release217+human/taxonomy/typematerial.dmp
refseq_release217+human/taxonomy/excludedfromtype.dmp
Terminated
Error: mv died

But metabuli mvdb is still active as a process.

Best
Remi

Species missing from Metabuli, but detected with Kraken 1/2

Hi @jaebeom-kim,

I'm attaching a FASTQ file with sequences from a public dataset; applying Kraken 1 or Kraken 2, these are
getting classified as 'Kovacikia minuta' at the species level. The dataset is from a microbial mat obtained from
corals, and these are rich in Cyanobacteria, so the assignment to Kovacikia seems to make sense. (With both
Kraken versions, Kovacikia is actually the dominant genus in this dataset.)

Using Metabuli with the RefSeq217 database, I don't get any Kovacikia hits at all; according to the database-report
workflow, this genome is present in the database, though. Instead, there are very few reads classified below 'class'
rank, and these are for different Cyanobacteriota.

Can you take a look?

Thanks,

  • Sebastian

Kovacikia.fq.txt

Help to understand aa and dna identity score calculations

Dear Metabuli Developpers,

To test the sensibility of Metabuli, I recently tried to match DNAseq reads from a Virus_A (taxid : 552509) to another closely related Virus_B (not present in the refseq database) to see if Metabuli was able to discover new closely related viruses in metagenomic samples.

So I took the reads from the Virus_B (files called VirusB_1.fastq and VirusB_2.fastq) and I took as database your pre-built refseq_virus database from which the proteins of the Virus_A are present inside.

Here is the code used :

metabuli classify VirusB_1.fastq VirusB_2.fastq RefSeq_virus_db Virus_genome --threads 20

In total, I successfully got 7074 hits.

>>> tab
      Classified                                 Read_ID  Tax_id  Read_length  DNA_identity  AA_identity  Hamming_distance Classification_Rank List_taxID:k-mer_count
0              1   A00553:76:H5N52DSX2:2:1101:6497:10927  552509          294      0.117347     0.153061                 7             species              552509:8
1              1  A00553:76:H5N52DSX2:2:1101:14226:11162  552509          294      0.117347     0.153061                 7             species              552509:8
2              1  A00553:76:H5N52DSX2:2:1101:30969:16141  552509          294      0.107143     0.142857                 7             species              552509:7
3              1  A00553:76:H5N52DSX2:2:1101:22634:19711  552509          294      0.117347     0.153061                 7             species              552509:8
4              1   A00553:76:H5N52DSX2:2:1101:8603:30452  552509          294      0.102041     0.142857                 8             species              552509:7
...          ...                                     ...     ...          ...           ...          ...               ...                 ...                    ...
7069           1   A00553:76:H5N52DSX2:2:2678:7627:14982  552509          294      0.086735     0.112245                 5             species              552509:4
7070           1  A00553:76:H5N52DSX2:2:2678:24189:20212  552509          294      0.091837     0.122449                 6             species              552509:5
7071           1  A00553:76:H5N52DSX2:2:2678:32325:22060  552509          294      0.181973     0.244898                13             species             552509:10
7072           1  A00553:76:H5N52DSX2:2:2678:14299:28823  552509          294      0.117347     0.142857                 5             species              552509:7
7073           1  A00553:76:H5N52DSX2:2:2678:17969:33896  552509          294      0.102041     0.142857                 8             species              552509:7

[7074 rows x 9 columns]

So it means that Metabuli can successfully find in metagenomic data closely related viruses, which other soft such as Kraken2 cannot If I well understood (correct me if I'm wrong). However, the AA and DNA scores of identity really surprised me (DNA mean identity =0.106 and AA mean identity =0.136 over the 7074 hits). This is far less than expected between Virus_A and Virus_B where the average AA mean identity is 0.391 when you do a blastp of the Virus_A vs Virus_B genomes.

I don't know if I am missing something and If it is clear. Could you please help me to understand the identity score calculation?

All the best

Will compressed fastq.gz files be available as input in the future?

Hello, thank you for this tool, it seems to be a very good alternative to kraken2 for example.

I have a question about the input you need in Metabuli. At the moment, users can add fasta or fastq files.

I was wondering if in future releases you would consider adding the ability to add compressed fastq.gz files directly, as this would really save space for the various projects that aim to analyse a large number of raw reads.

All the best.

Bus error (core dumped)

Hi - I recently installed Metabuli (great tool -- thank you) with conda and got the "bus error" mentioned above following two attempts, one on illumina paired-end reads and the other on Nanopore reads. Have you seen this before (I have not)?
Thanks for your help.

Memory unit consistency

Hi there,

This is a really small thing, but an easy fix. The help message and documentation here on GitHub show the unit for memory in GiB (which already led me to an out of memory error since I missed the i in GiB), but the log files show the RAM usage in GB. I suppose what the log says doesn't matter too much if the user knows they're using gibibytes, but I thought it worth pointing out anyway.

Cheers,
Sean

Add the full taxonomic classification to JobID_classifications.tsv

Hello, thank you for developing Metabuli, it has worked great with my test datasets!

One minor suggestion is to add or have an option to add the full taxonomy in JobID_classifications.tsv or a similar report.

For example, if a read or contig gets classified down to species level, such as Escherichia coli using GTDB pre-built database, it would be great to have the full d__Bacteria; p__Pseudomonadota; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia; s__Escherichia coli output automatically included in the report.

This feature would make metabulis output way more user-friendly and interpretable, especially for people who don't want to spend time sifting through taxID numbers.

Cheers!

Random Bus error during database building

Dear authors,

Thank you for your work, I'm glad I found this program. Based on the runs I've done so far, I've gotten pretty reliable results with the pre-built databases. My problem is that I want to build a custom database that includes all archaeal, bacterial and fungal reference genomes (23.547 genomes in total). Unfortunately, despite my many attempts, I was unable to create the database, because after executing the build command, the runs stopped at completely random stages with a Bus error message (the total size of the generated files ranged from 5 to 220 GB). During the database building, I followed the description found on your site and used the provided scripts. I was able to complete these steps without any problems. After the build command was executed, I monitored my computer's resources and did not experience any storage or free memory issues. I tried to run the software on several computers running Ubuntu 23.10 and Microsoft WSL2 environment as well. In each case, I tried both the precompiled version and the version compiled by myself. Can you help me to solve this issue?

Thanks,
Sandor

Low complexity regions in query reads.

Metabuli allocates memory for storing matches between query k-mers and reference k-mers.
When your FASTA or FASTQ files are populated with low-complexity sequences, the memory for matches is filled up so quickly.
In such cases, Metabuli generates overflow!!! signal in the log.
It doesn't kill the process, but the generated results files are expected to be unreliable.

Obviously, Metabuli should also be able to handle such cases, so we will improve this point soon.
Until we close this issue, if you meet overflow!!! signal in the log, please try again after filtering out low-complexity sequences.

Metabuli v1.0.0 segfault

Hi there,

Metabuli looks very interesting, but I'm encountering segmentation faults at two different stages (see below).
Interestingly, I'm seeing three different outcomes for the same input data, and even just changing the location
of the input file seems to be sufficient to arrive at a different result (one of: 1. it just works 2. segfault in metamer
extraction, 3. segfault in metamer comparison).

Metabuli v.1.0.0 compiled from source (-DCMAKE_BUILD_TYPE=Debug), with your pre-built database obtained as per
metabuli databases RefSeq refseq tmp.

/vol/mgx-sw/bin/metabuli classify --seq-mode 1 xxxy.in /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir jobid --threads 1 --max-ram 64

System is a Intel(R) Xeon(R) CPU E5-4627 with 1TB of memory, running Ubuntu 20.04.6 LTS.

  1. Segfault during metamer extraction
Number of threads: 1
Query file: xxxy.in
Database directory: /vol/biodb/local_databases/MGX/metabuli/refseq/
Output directory: outdir
Job ID: jobid
Loading nodes file ... Done, got 2497728 nodes
Loading merged file ... Done, added 71172 merged nodes.
Loading names file ... Done
Init RMQ ...Done
The rest RAM: 68585259008
Indexing query file ...Done
Total number of sequences: 100000
Total read length: 16830253nt
Extracting query metamers ... 
Segmentation fault
(gdb) bt
#0  0x000055d9a8ce76ce in kseq_buffer_reader (inBuffer=0x7ffc8d6bf8c0, outBuffer=0x55d9b3b01f00 "p\343Ko\340\177", nbyte=16384)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/lib/mmseqs/src/commons/KSeqBufferReader.h:30
#1  0x000055d9a8d22ac0 in ks_getc (ks=0x55d9aaf6d650) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/SeqIterator.h:25
#2  0x000055d9a8d231af in kseq_read (seq=0x55d9aaf6d5c0) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/SeqIterator.h:25
#3  0x000055d9a8d30cf9 in Classifier::_ZN10Classifier27fillQueryKmerBufferParallelER15QueryKmerBufferR10MmapedDataIcERKSt6vectorI13SequenceBlockSaIS6_EERS5_I5QuerySaISB_EERKSt4pairImmERK15LocalParameters._omp_fn.0(void) ()
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.cpp:354
#4  0x00007fe06f5158e6 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
#5  0x000055d9a8d25d43 in Classifier::fillQueryKmerBufferParallel (this=0x55d9aaf6c340, kmerBuffer=..., seqFile=..., 
    seqs=std::vector of length 100000, capacity 131072 = {...}, queryList=std::vector of length 100001, capacity 100001 = {...}, 
    currentSplit={...}, par=...) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.cpp:345
#6  0x000055d9a8d25585 in Classifier::startClassify (this=0x55d9aaf6c340, par=...)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.cpp:260
#7  0x000055d9a8dab1a7 in classify (argc=10, argv=0x7ffc8d6c1958, command=...)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/workflow/classify.cpp:47
#8  0x000055d9a8db8ea4 in runCommand (p=0x55d9aaf571e0, argc=10, argv=0x7ffc8d6c1958)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/lib/mmseqs/src/commons/Application.cpp:40
#9  0x000055d9a8db9ecc in main (argc=12, argv=0x7ffc8d6c1948) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/lib/mmseqs/src/commons/Application.cpp:203
(gdb) p i
$1 = 0
(gdb) p inBuffer
$2 = (kseq_buffer_t *) 0x7ffc8d6bf8c0
(gdb) p index
$3 = 0
(gdb)
  1. Segfault during Comparing qeury and reference metamers stage
Number of threads: 1
Query file: /vol/sge-tmp/metab_test/xxxy.in
Database directory: /vol/biodb/local_databases/MGX/metabuli/refseq/
Output directory: outdir
Job ID: jobid
Loading nodes file ... Done, got 2497728 nodes
Loading merged file ... Done, added 71172 merged nodes.
Loading names file ... Done
Init RMQ ...Done
The rest RAM: 68585259008
Indexing query file ...Done
Total number of sequences: 100000
Total read length: 16830253nt
Extracting query metamers ... 
Time spent for metamer extraction: 1
Sorting query metamer list ...
Time spent for sorting query metamer list: 6
Comparing qeury and reference metamers...
Segmentation fault (core dumped)
(gdb) bt
#0  Classifier::getNextTargetKmer (lookingTarget=589166591814214271, diffIdxBuffer=0x7f77f313b010, diffBufferIdx=@0x7ffdcc68b000: 80091129, 
    totalPos=@0x7ffdcc68b010: 25564673655) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.h:394
#1  0x0000557e854dc587 in Classifier::_ZN10Classifier20linearSearchParallelEP9QueryKmerRmRNS_6BufferI5MatchEERK15LocalParameters._omp_fn.0(void)
    () at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.cpp:723
#2  0x00007f77fdcfe8e6 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
#3  0x0000557e854d0a35 in Classifier::linearSearchParallel (this=0x557e87dd4320, queryKmerList=0x7f77ad3a2010, 
    queryKmerCnt=@0x7ffdcc68cb38: 28860420, matchBuffer=..., par=...) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.cpp:570
#4  0x0000557e854cf72f in Classifier::startClassify (this=0x557e87dd4320, par=...)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/commons/Classifier.cpp:281
#5  0x0000557e855551a7 in classify (argc=10, argv=0x7ffdcc68d318, command=...)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/src/workflow/classify.cpp:47
#6  0x0000557e85562ea4 in runCommand (p=0x557e87dbf1e0, argc=10, argv=0x7ffdcc68d318)
    at /vol/mgx-sw/src/tools/Metabuli-1.0.0/lib/mmseqs/src/commons/Application.cpp:40
#7  0x0000557e85563ecc in main (argc=12, argv=0x7ffdcc68d308) at /vol/mgx-sw/src/tools/Metabuli-1.0.0/lib/mmseqs/src/commons/Application.cpp:203
(gdb)

I'm attaching the sample input file I've been using for this.
input.fas.txt

Bus error (core dumped)

Hello,

I have built a custom database, and during the analysis step, I get:

(...)
Extracting query metamers ...
Time spent for metamer extraction: 52
Sorting query metamer list ...
Time spent for sorting query metamer list: 293
Comparing query and reference metamers...
Bus error (core dumped)

I wonder if my custom reference is to blame, or my reads, or something else, the db files look like this:

-rw-r--r-- 1 user group 3879768762 Nov 16 15:44 0_diffIdx
-rw-r--r-- 1 user group 3754395992 Nov 16 15:44 0_info
(...)
-rw-r--r-- 1 user group 3925091364 Nov 16 16:12 7_diffIdx
-rw-r--r-- 1 user group 3800137728 Nov 16 16:12 7_info
-rw-r--r-- 1 user group 3917471188 Nov 16 16:16 8_diffIdx
-rw-r--r-- 1 user group 3782011244 Nov 16 16:16 8_info
-rw-r--r-- 1 user group 3909695158 Nov 16 16:21 9_diffIdx
-rw-r--r-- 1 user group 3783195404 Nov 16 16:21 9_info
-rw-r--r-- 1 user group    1596492 Nov 16 15:40 acc2taxid.map
-rw-r--r-- 1 user group          0 Nov 16 21:03 diffIdx
-rw-r--r-- 1 user group          0 Nov 16 21:03 info
-rw-r--r-- 1 user group          0 Nov 16 21:03 split
-rw-r--r-- 1 user group   63045551 Nov 16 15:40 taxID_list
drwxr-xr-x 2 user group         98 Nov 16 15:26 taxonomy

There are these files of size 0, it this expected ?

Thanks for the help

Mathieu

Feature request: support non-GC[AF] accessions

Hi! I went to try out Metabuli (v1.0.4) with a custom database (thanks for the great docs!) but hit an issue:

During processing /test/test.fasta, accession testaccession is not found in the mapping file. It is skipped.

Inspecting the code reveals what appears to me is that Metabuli requires accessions follow NCBI convention with GCF_* or GCA_* (https://github.com/steineggerlab/Metabuli/blob/1.0.4/src/workflow/add_to_library.cpp#L156-L157). If correct, would it be possible to add support for non-NCBI formatted accessions? I believe many other metagenomic tools have this capability which is helpful for custom databases.

Thanks for your consideration!

More segmentation faults

Hi there,

As with #10 I am experiencing segmentation faults at the stage of "Extracting query metamers ...".

I am getting these errors whether I build from source with:

git clone https://github.com/steineggerlab/Metabuli.git 
module load gcc cmake (I am on a Digital Research Alliance of Canada cluster, I need to load these modules to build)
cd Metabuli
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release .. -DCMAKE_INSTALL_PREFIX=/home/sdwork/software/metabuli
make -j4
make install

I ran the command:

metabuli classify OCH16_1_val_1.fq OCH16_2_val_2.fq /home/sdwork/scratch/metagenomics/gtdb fq_och16 fq_och16 --threads 32

When I tried looking at the core dump with gdb I saw:

Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000004568d3 in SeqIterator::fillQueryKmerBuffer(char const*, int, QueryKmerBuffer&, unsigned long&, unsigned int, unsigned int) ()

I tried just using a pre-compiled binary on the cluster and saw the same error.

I tried downloading/installing using conda on one of our local machines and I encounter the exact same problem. I tried changing the permissions as was suggested in #10 and I see the same issues. I downloaded GTDB database locally with:

metabuli databases GTDB207 gtdb tmp

and am trying to run the command:

metabuli classify OCH16_1.fq OCH16_2.fq gtdb och16_out och16 --threads 14 --max-ram 50

The output I see is:

Number of threads: 14
Query file 1: OCH16_1.fq
Query file 2: OCH16_2.fq
Database directory: gtdb
Output directory: och16_out
Job ID: och16
Loading nodes file ... Done, got 406311 nodes
Loading merged file ... Done, added 0 merged nodes.
Loading names file ... Done
Init RMQ ...Done
The rest RAM: 51808043008
Indexing query file ...Done
Total number of sequences: 75074832
Total read length: 22393292176nt
Extracting query metamers ...
Segmentation fault (core dumped)

Profiling eukaryote contigs?

This looks like a great tool.

I'm wondering though how well metabuli would perform classifying environmental mciroeukaryotes. Particularly, because it looks like that Prodigal is used to generate the databases, which is not ideal for euk gene predictions. Would metabuli outperform MMSeqs2 Taxonomy with nr database for assigning euk taxonomy to contigs? If so, what metabuli database would be best suited?

Thanks!

Database with single pan-genome

I'm interested in performing ultra-fast classification of metagenomic reads against a database that contains a single pangenome (thousands of genomes of the same species). My goal is to identify the subset of reads in a metagenome in FASTQ format that might come from my genome of interest. I care a lot more about sensitivity than specificity. Would Metabuli be a good choice for this problem?

Inconsistency of read assignments

Hi @JaebeomKim0731and @martin-steinegger,

I noticed something unexpected - read classifications are different between runs depending on the --threads parameter:

metabuli classify --seq-mode 1 testread.fas /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir10 jobid --threads 10

metabuli classify --seq-mode 1 testread.fas /vol/biodb/local_databases/MGX/metabuli/refseq/ outdir20 jobid --threads 20
testread.fas.txt

$ cat outdir10/jobid_classifications.tsv outdir20/jobid_classifications.tsv 
1       testseq 1501268 183     0.606557        0.622951        2       species 1501268:18 
1       testseq 1219    183     0.631148        0.672131        5       species 74546:1 167546:2 1219:24

This seems to happen quite frequently, e.g. for a test dataset with 100.000 sequences, I see over 4.000 differences in the
output. Can you try to reproduce this?

Metabuli produces very different results from mmseqs on longread contigs

Dear authors,

Thank you for your work! The tool looks great, and out preliminary testing showed fantastic results for the short reads.

We are interested in contig-level annotations, but in the paper it is stated as the read-level tool. Do you see any problems with using Metabuli for contigs?

We tested it on the CAMI shortread contigs and got amazing species annotations for 99% of contigs. However, after running it on the real longread data, we are seeing major differences (~25% of the annotated part of the dataset) with how mmseqs annotates the same dataset, starting from the phylum level already. Do you know what can be the problem? Does it makes sense to use the tool on the longread contigs?

Thank you,
Svetlana

Metabuli for eukaryotes: use cases and future plans.

Metabuli project was started targeting prokaryotes and viruses.
However, since we are hearing use cases for eukaryotes and some promising performance for user side,
we are planning to optimize default settings or to add some parameters for eukaryotes.
Providing a pre-built database covering both eukaryotes and prokaryotes is also listed in the to-do list.

Here are some cases of Metabuli with eukaryotes.

  1. Environmental DNA metabarcoding for surveying marine vertebrate (benchmarks)
    Metabuli showed promising performance in classifying simulated 12S and 16S amplicon data of marine vertebrates
    Working parameters: --seq-mode 1 --min-cons-cnt-euk 4 --tie-ratio 0.99

  2. Test Metabuli for fungi.
    With --min-cons-cnt-euk 4, Metabuli correctly classified 97% of paired-end reads simulated from a fungal species when its genome is included in DB.
    But the percentage was dropped to 12% with the default setting (--min-cons-cnt 9).

For now, --min-cons-cnt-euk is thought to be a critical parameter.
It determines the minimum number of consecutive k-mer hits to be classified.
The strict default value of --min-cons-cnt-euk 9 was decided on older version of Metabuli as a quick remedy to reduce false positive eukaryote hits resulted by their larger genomes.
Even though we added noise filtering steps to reduce the false positives, we didn't tweak the value for eukaryotes.
Based on the user's report, setting --min-cons-cnt-euk as lower value like 4 or 5 would be good for now.
After some tests, we will make a new releases with an optimized default value.

+++
Please share your thoughts on how and what to optimize Metabuli for eukaryotes!
It helps us a lot to make Metabuli more useful for your research.

add-to-library for custom db - std::logic_error

Hello,

I am trying to build a custom db for 98 NCBI RefSeq fungal genomes and am running into the following error. I'm not sure if this is a bug or if there is something wrong with the structure of my input files:

command (metabuli v1.0.1):
metabuli add-to-library fasta_list.txt accession2taxid metabuli_refseq_fungi_db

output:

Loading nodes file ... Done, got 2550799 nodes
Loading merged file ... Done, added 75762 merged nodes.
Loading names file ... Done
Init RMQ ...Done
Load mapping from accession ID to taxonomy ID
done
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string: construction from null is not valid

head accession2taxid:

accession	accession.version	taxid	species_taxid
GCF_000002945	GCF_000002945.1	4896	4896
GCF_003054445	GCF_003054445.1	4909	4909
GCF_000243375	GCF_000243375.1	4950	4950
GCF_000026365	GCF_000026365.1	4956	4956

head fasta_list.txt:

/path/metabuli_refseq_fungi_db/library/GCF_000219625.1_MYCGR_v2.0_genomic.fna.gz
/path/metabuli_refseq_fungi_db/library/GCF_014133895.1_ASM1413389v1_genomic.fna.gz
/path/metabuli_refseq_fungi_db/library/GCF_019915245.1_ASM1991524v1_genomic.fna.gz
/path/metabuli_refseq_fungi_db/library/GCF_000027005.1_ASM2700v1_genomic.fna.gz

In my DBDIR (metabuli_refseq_fungi_db) I have a taxonomy folder containing the unpacked files from NCBI taxdump, and a library folder containing the gzipped fasta sequences. The documentation on the README and on the --help page for add-to-library are not clear to me whether the fasta files should already be stored in DBDIR/library before running add-to-library or if they can be stored elsewhere, but I have tried both and am getting the same error.

Appreciate any help you can provide, thanks!

metabuli-inspect?

Hi again @JaebeomKim0731,

would it be possible (similar to kraken2-inspect) to provide a way to list the
taxa contained in a Metabuli database?

I'm currently playing around with assembled and binned metagenomic contigs
and different ways to taxonomically assign them; in cases where I obtain different
classifications from different approaches, it would be great to be able to check if
a certain taxon is included in a Metabuli database.

What do you think?

Sebastian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.