Giter Club home page Giter Club logo

2017-sourmash-lca's Issues

Zero LCA classifications when many expected

Zero LCA classifications were found when many were expected.

I want to compare the performance of kraken/braken with sourmash lca. Therefore, I need the databases to have the same content. @luizirber downloaded genomes and made a kraken database from these. He posted this database and accompanying files on Google Drive, which I downloaded using rclone. My goal was to make the LCA database with the genomes in the kraken database, and then use this database on the podar reads.

I ran the following commands from a blank Ubuntu Xenial 16.04 AWS instance.

Install rclone

sudo apt-get update && sudo apt-get -y install unzip
wget https://downloads.rclone.org/rclone-v1.38-linux-amd64.zip
cd rclone-*-linux-amd64
sudo cp rclone /usr/bin/
sudo chown root:root /usr/bin/rclone
sudo chmod 755 /usr/bin/rclone

Configure R clone for google drive.

rclone sync remote:kraken_db .

Install sourmash

sudo apt-get -y update && \
sudo apt-get install -y python3.5-dev python3.5-venv make \
    libc6-dev g++ zlib1g-dev

python3.5 -m venv ~/sourmash20171007
. ~/sourmash20171007/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer

pip install -U https://github.com/dib-lab/sourmash/archive/master.zip

Install sourmash lca functionality

git clone https://github.com/ctb/2017-sourmash-lca.git

Calculate signatures of refseq files
Default database of kraken is 31. When computing signatures, use 21, 31, 51.

cd ~/sourmash_sigs
for infile in library/*/*/*.fna
do
  sourmash compute -k 21,31,51 --output ${infile}.sig --scaled 10000 --name-from-first ${infile}
done

Prepare the database

mkdir refseq_lca
cd refseq_lca
mkdir genbank
cd genbank
curl -O -L ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xzf taxdump.tar.gz nodes.dmp names.dmp
cd ..

curl -O -L https://github.com/dib-lab/2017-ncbi-taxdump/raw/master/genbank-genomes-accession%2Blineage-20170529.csv.gz

~/2017-sourmash-lca/extract.py sbt.lca genbank*.csv.gz genbank/nodes.dmp ~/sourmash_sigs/library/*/*/*.sig --lca-json=sbt.lca.json

Grab Podar reads

scp -i ~/pat/to/key ~/Desktop/mircea/SRR606249.pe.qc.fq.gz.abundtrim.gz [email protected]:~/

Compute sourmash signatures for podar reads

sourmash compute -k 21,31,51 --scaled 10000 -o SRR606249.pe.qc.fq.gz.abundtrim.sig SRR606249.pe.qc.fq.gz.abundtrim.gz

Classify reads

~/2017-sourmash-lca/classify.py -k 31 ~/refseq_lca/sbt.lca.json ~/SRR606249.pe.qc.fq.gz.abundtrim.sig

This did not work and output the following message:

loading taxonomic nodes from: /home/ubuntu/refseq_lca/genbank/nodes.dmp
loading taxonomic names from: /home/ubuntu/refseq_lca/genbank/names.dmp
loading k-mer DB from: /home/ubuntu/refseq_lca/sbt.lca
loading signatures from 1 signature files
loaded 1 signatures total at k=31
downsampling to scaled value: 10000
found LCA classifications for 0 of 6228 hashes
cannot find taxid 0; quitting.
percent	below	at node	code	taxid	name
100.0	6228	6228	-	0	-
100.0	6228	6228	U	0	not classified

I then tried using genbank as in the README.md example, and this worked fine.

mkdir ~/genbank_lca
cd genbank_lca

curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz
tar xzf genbank-lca-2017.08.26.tar.gz

classify reads with genbank

~/2017-sourmash-lca/classify.py ~/genbank_lca/genbank.lca.json SRR606249.pe.qc.fq.gz.abundtrim.sig

I'm not sure what the problem is.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.