sourmash-bio / databases Goto Github PK

View Code? Open in Web Editor NEW

12.0 17.0 1.0 3.86 MB

Build sourmash databases for genbank.

Python 77.35% Shell 22.65%

sourmash

databases's People

Contributors

Stargazers

Watchers

Forkers

aeschriefer

databases's Issues

Validation steps for new databases

What checks should be executed when a new database is built?

Pointers:

sourmash-bio/sourmash#849 (comment) for commands for checking for duplicated sigs

RefSeq representative genomes

NCBI is releasing the RefSeq representative genomes, similar to how GTDB has their dereplicated genomes.
https://ncbiinsights.ncbi.nlm.nih.gov/2020/08/21/updated-representative-genomes/

Might be worth building an equivalent sourmash database and version as NCBI release new versions?

what should our signature names look like for databases?

for genbank, we do --name-from-first, so we get output like this:

CP001941.1 Aciduliprofundum boonei T4...

for gtdb, we do trickier name setting, so we get output like this:

GCF_000025665 s__Aciduliprofundum boonei

with the main difference here being that the GCF_ identifier points to the identifier for the whole genome, not just the first sequence. That seems better.

We could add an optional identifier string to signatures. Hrm. Ref sourmash-bio/sourmash#268 for more such questions.

ref #7

indexing from file (for sourmash 4.x)

With sourmash 4.x, we can build sbts from a file containing a list of signatures. This seems to circumvent long DAG solve times for snakemake w/many files. Working well for gtdb r95 (~31k genomes; see /group/ctbrowngrp/gtdb-r95/ on farm) via the code below.

localrules: signames_to_file

rule signames_to_file:
    input:  expand(os.path.join(out_dir, "signatures", "{sample}.sig"), sample=sample_names),
    output: os.path.join(out_dir, "index", "{basename}.signatures.txt")
    run:
        with open(str(output), "w") as outF:
            for inF in input:
                outF.write(str(inF) + "\n")

rule index_sbt:
    input: os.path.join(out_dir, "index", "{basename}.signatures.txt")
    output: os.path.join(out_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.sbt.zip"),
    threads: 1
    params:
        alpha_cmd = lambda w: alphabet_info[w.alphabet]["alpha_cmd"],
        ksize = lambda w: int(w.ksize)*int(alphabet_info[w.alphabet]["ksize_multiplier"]),
    resources:
        mem_mb=lambda wildcards, attempt: attempt *50000,
        runtime=6000,
    log: os.path.join(logs_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.index-sbt.log")
    benchmark: os.path.join(benchmarks_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.index-sbt.benchmark")
    conda: "envs/sourmash-dev.yml"
    shell:
        """
        sourmash index {output} --ksize {params.ksize} \
        --scaled {wildcards.scaled} {params.alpha_cmd}  \
        --from-file {input}  2> {log}
        """

from: https://github.com/bluegenes/thumper/blob/master/thumper/index.snakefile
for sketching rules, see https://github.com/bluegenes/thumper/blob/master/thumper/thumper.snakefile

25k signatures

hi @ctb,

I think the "real" GTDB has about 150k genomes, and 25k genomes sounds like a dereplicated set used by the GTDB toolkit, or am I missing something here? Maybe the 25k genomes correspond to this subset by the same group around Donovan Parks?

kind regards,
Adrian

Building genbank/refseq databases from assembly_summary.txt

related to sourmash-bio/sourmash#970

Each subset of RefSeq and GenBank has an assembly_summary.txt file.
This is from fungi: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
All refseq subsets: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Benefits of using assembly_summary.txt:

We can generate good names for signatures from the columns, using assembly_accession, organism_name, infraspecific_name and asm_name. For example, for GCF_001477545.1, the name could be GCF_001477545.1 Pneumocystis carinii B80 strain=B80, Pneu_cari_B80_V3
The taxid field can be used to generate TaxInfo and save it in the Zipped SBT during indexing. Because we control both the name (instead of using --name-from-first) and how it is saved in the TaxInfo, scripts for converting results like gather_to_opal.py can be simplified.
We can distribute one database per refseq/genbank subset, so people don't need to download a gigantic one for everything, but if they want to use all of them it's not a problem too (just list them all in gather or search)

More info: https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

check genome_updater for genbank/refseq downloads

I used https://github.com/kblin/ncbi-genome-download to download data before, but it is a bit hard to update after the first download. genome_updater seems to be geared towards this use case: https://github.com/pirovc/genome_updater

build a streamlined Genbank/RefSeq based on GTDB 25k genomes

ref #4 and #7, I think providing a small database using the GTDB 25k genomes, but with NCBI names/taxonomies instead, would be quite useful to many people.

the logic is that:

NCBI taxonomy is a mess, but people like it and are used to it;
since GTDB 25k is a nice low-redundancy collection of genomes, they're good to match against;
so we could provide just those genomes, but with NCBI taxonomy instead of GTDB taxonomy.

I guess we'd want to make sure the names are NCBI names where possible, and we'd want to provide a lineages CSV with it.

Please see other repositories and issues!

Current summary of database construction and release (april 2022) - sourmash-bio/sourmash#2015

Example scripts and workflows for constructing large databases - https://github.com/sourmash-bio/database-examples

Database releases - https://github.com/sourmash-bio/database-releases

upgrade sourmash_databases soon (for sourmash 4.0)

ref sourmash-bio/sourmash#970

sourmash-bio / databases Goto Github PK

databases's People

Contributors

Stargazers

Watchers

Forkers

databases's Issues

Validation steps for new databases

RefSeq representative genomes

what should our signature names look like for databases?

indexing from file (for sourmash 4.x)

25k signatures

Building genbank/refseq databases from assembly_summary.txt

check genome_updater for genbank/refseq downloads

build a streamlined Genbank/RefSeq based on GTDB 25k genomes

Please see other repositories and issues!

upgrade sourmash_databases soon (for sourmash 4.0)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent