2017-sourmash-lca's Issues
Zero LCA classifications when many expected
Zero LCA classifications were found when many were expected.
I want to compare the performance of kraken/braken with sourmash lca. Therefore, I need the databases to have the same content. @luizirber downloaded genomes and made a kraken database from these. He posted this database and accompanying files on Google Drive, which I downloaded using rclone. My goal was to make the LCA database with the genomes in the kraken database, and then use this database on the podar reads.
I ran the following commands from a blank Ubuntu Xenial 16.04 AWS instance.
Install rclone
sudo apt-get update && sudo apt-get -y install unzip
wget https://downloads.rclone.org/rclone-v1.38-linux-amd64.zip
cd rclone-*-linux-amd64
sudo cp rclone /usr/bin/
sudo chown root:root /usr/bin/rclone
sudo chmod 755 /usr/bin/rclone
Configure R clone for google drive.
rclone sync remote:kraken_db .
Install sourmash
sudo apt-get -y update && \
sudo apt-get install -y python3.5-dev python3.5-venv make \
libc6-dev g++ zlib1g-dev
python3.5 -m venv ~/sourmash20171007
. ~/sourmash20171007/bin/activate
pip install -U pip
pip install -U Cython
pip install -U jupyter jupyter_client ipython pandas matplotlib scipy scikit-learn khmer
pip install -U https://github.com/dib-lab/sourmash/archive/master.zip
Install sourmash lca functionality
git clone https://github.com/ctb/2017-sourmash-lca.git
Calculate signatures of refseq files
Default database of kraken is 31. When computing signatures, use 21, 31, 51.
cd ~/sourmash_sigs
for infile in library/*/*/*.fna
do
sourmash compute -k 21,31,51 --output ${infile}.sig --scaled 10000 --name-from-first ${infile}
done
Prepare the database
mkdir refseq_lca
cd refseq_lca
mkdir genbank
cd genbank
curl -O -L ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xzf taxdump.tar.gz nodes.dmp names.dmp
cd ..
curl -O -L https://github.com/dib-lab/2017-ncbi-taxdump/raw/master/genbank-genomes-accession%2Blineage-20170529.csv.gz
~/2017-sourmash-lca/extract.py sbt.lca genbank*.csv.gz genbank/nodes.dmp ~/sourmash_sigs/library/*/*/*.sig --lca-json=sbt.lca.json
Grab Podar reads
scp -i ~/pat/to/key ~/Desktop/mircea/SRR606249.pe.qc.fq.gz.abundtrim.gz [email protected]:~/
Compute sourmash signatures for podar reads
sourmash compute -k 21,31,51 --scaled 10000 -o SRR606249.pe.qc.fq.gz.abundtrim.sig SRR606249.pe.qc.fq.gz.abundtrim.gz
Classify reads
~/2017-sourmash-lca/classify.py -k 31 ~/refseq_lca/sbt.lca.json ~/SRR606249.pe.qc.fq.gz.abundtrim.sig
This did not work and output the following message:
loading taxonomic nodes from: /home/ubuntu/refseq_lca/genbank/nodes.dmp
loading taxonomic names from: /home/ubuntu/refseq_lca/genbank/names.dmp
loading k-mer DB from: /home/ubuntu/refseq_lca/sbt.lca
loading signatures from 1 signature files
loaded 1 signatures total at k=31
downsampling to scaled value: 10000
found LCA classifications for 0 of 6228 hashes
cannot find taxid 0; quitting.
percent below at node code taxid name
100.0 6228 6228 - 0 -
100.0 6228 6228 U 0 not classified
I then tried using genbank as in the README.md example, and this worked fine.
mkdir ~/genbank_lca
cd genbank_lca
curl -L https://osf.io/zfmbd/download?version=1 -o genbank-lca-2017.08.26.tar.gz
tar xzf genbank-lca-2017.08.26.tar.gz
classify reads with genbank
~/2017-sourmash-lca/classify.py ~/genbank_lca/genbank.lca.json SRR606249.pe.qc.fq.gz.abundtrim.sig
I'm not sure what the problem is.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.