sourmash-bio / databases Goto Github PK
View Code? Open in Web Editor NEWBuild sourmash databases for genbank.
Build sourmash databases for genbank.
What checks should be executed when a new database is built?
Pointers:
NCBI is releasing the RefSeq representative genomes, similar to how GTDB has their dereplicated genomes.
https://ncbiinsights.ncbi.nlm.nih.gov/2020/08/21/updated-representative-genomes/
Might be worth building an equivalent sourmash database and version as NCBI release new versions?
for genbank, we do --name-from-first
, so we get output like this:
CP001941.1 Aciduliprofundum boonei T4...
for gtdb, we do trickier name setting, so we get output like this:
GCF_000025665 s__Aciduliprofundum boonei
with the main difference here being that the GCF_ identifier points to the identifier for the whole genome, not just the first sequence. That seems better.
We could add an optional identifier string to signatures. Hrm. Ref sourmash-bio/sourmash#268 for more such questions.
ref #7
With sourmash 4.x, we can build sbts from a file containing a list of signatures. This seems to circumvent long DAG solve times for snakemake w/many files. Working well for gtdb r95
(~31k genomes; see /group/ctbrowngrp/gtdb-r95/
on farm) via the code below.
localrules: signames_to_file
rule signames_to_file:
input: expand(os.path.join(out_dir, "signatures", "{sample}.sig"), sample=sample_names),
output: os.path.join(out_dir, "index", "{basename}.signatures.txt")
run:
with open(str(output), "w") as outF:
for inF in input:
outF.write(str(inF) + "\n")
rule index_sbt:
input: os.path.join(out_dir, "index", "{basename}.signatures.txt")
output: os.path.join(out_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.sbt.zip"),
threads: 1
params:
alpha_cmd = lambda w: alphabet_info[w.alphabet]["alpha_cmd"],
ksize = lambda w: int(w.ksize)*int(alphabet_info[w.alphabet]["ksize_multiplier"]),
resources:
mem_mb=lambda wildcards, attempt: attempt *50000,
runtime=6000,
log: os.path.join(logs_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.index-sbt.log")
benchmark: os.path.join(benchmarks_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.index-sbt.benchmark")
conda: "envs/sourmash-dev.yml"
shell:
"""
sourmash index {output} --ksize {params.ksize} \
--scaled {wildcards.scaled} {params.alpha_cmd} \
--from-file {input} 2> {log}
"""
from: https://github.com/bluegenes/thumper/blob/master/thumper/index.snakefile
for sketching rules, see https://github.com/bluegenes/thumper/blob/master/thumper/thumper.snakefile
hi @ctb,
I think the "real" GTDB has about 150k genomes, and 25k genomes sounds like a dereplicated set used by the GTDB toolkit, or am I missing something here? Maybe the 25k genomes correspond to this subset by the same group around Donovan Parks?
kind regards,
Adrian
related to sourmash-bio/sourmash#970
Each subset of RefSeq and GenBank has an assembly_summary.txt
file.
This is from fungi: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
All refseq subsets: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
Benefits of using assembly_summary.txt
:
assembly_accession
, organism_name
, infraspecific_name
and asm_name
. For example, for GCF_001477545.1, the name could be GCF_001477545.1 Pneumocystis carinii B80 strain=B80, Pneu_cari_B80_V3
taxid
field can be used to generate TaxInfo and save it in the Zipped SBT during indexing. Because we control both the name (instead of using --name-from-first
) and how it is saved in the TaxInfo
, scripts for converting results like gather_to_opal.py can be simplified.gather
or search
)More info: https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf
I used https://github.com/kblin/ncbi-genome-download to download data before, but it is a bit hard to update after the first download. genome_updater
seems to be geared towards this use case: https://github.com/pirovc/genome_updater
ref #4 and #7, I think providing a small database using the GTDB 25k genomes, but with NCBI names/taxonomies instead, would be quite useful to many people.
the logic is that:
I guess we'd want to make sure the names are NCBI names where possible, and we'd want to provide a lineages CSV with it.
see also sourmash-bio/sourmash#969
Current summary of database construction and release (april 2022) - sourmash-bio/sourmash#2015
Example scripts and workflows for constructing large databases - https://github.com/sourmash-bio/database-examples
Database releases - https://github.com/sourmash-bio/database-releases
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.