Giter Club home page Giter Club logo

databases's People

Contributors

ctb avatar luizirber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

aeschriefer

databases's Issues

what should our signature names look like for databases?

for genbank, we do --name-from-first, so we get output like this:

CP001941.1 Aciduliprofundum boonei T4...

for gtdb, we do trickier name setting, so we get output like this:

GCF_000025665 s__Aciduliprofundum boonei

with the main difference here being that the GCF_ identifier points to the identifier for the whole genome, not just the first sequence. That seems better.

We could add an optional identifier string to signatures. Hrm. Ref sourmash-bio/sourmash#268 for more such questions.

ref #7

indexing from file (for sourmash 4.x)

With sourmash 4.x, we can build sbts from a file containing a list of signatures. This seems to circumvent long DAG solve times for snakemake w/many files. Working well for gtdb r95 (~31k genomes; see /group/ctbrowngrp/gtdb-r95/ on farm) via the code below.

localrules: signames_to_file

rule signames_to_file:
    input:  expand(os.path.join(out_dir, "signatures", "{sample}.sig"), sample=sample_names),
    output: os.path.join(out_dir, "index", "{basename}.signatures.txt")
    run:
        with open(str(output), "w") as outF:
            for inF in input:
                outF.write(str(inF) + "\n")

rule index_sbt:
    input: os.path.join(out_dir, "index", "{basename}.signatures.txt")
    output: os.path.join(out_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.sbt.zip"),
    threads: 1
    params:
        alpha_cmd = lambda w: alphabet_info[w.alphabet]["alpha_cmd"],
        ksize = lambda w: int(w.ksize)*int(alphabet_info[w.alphabet]["ksize_multiplier"]),
    resources:
        mem_mb=lambda wildcards, attempt: attempt *50000,
        runtime=6000,
    log: os.path.join(logs_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.index-sbt.log")
    benchmark: os.path.join(benchmarks_dir, "index", "{basename}.{alphabet}-k{ksize}-scaled{scaled}.index-sbt.benchmark")
    conda: "envs/sourmash-dev.yml"
    shell:
        """
        sourmash index {output} --ksize {params.ksize} \
        --scaled {wildcards.scaled} {params.alpha_cmd}  \
        --from-file {input}  2> {log}
        """

from: https://github.com/bluegenes/thumper/blob/master/thumper/index.snakefile
for sketching rules, see https://github.com/bluegenes/thumper/blob/master/thumper/thumper.snakefile

Building genbank/refseq databases from assembly_summary.txt

related to sourmash-bio/sourmash#970

Each subset of RefSeq and GenBank has an assembly_summary.txt file.
This is from fungi: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt
All refseq subsets: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Benefits of using assembly_summary.txt:

  • We can generate good names for signatures from the columns, using assembly_accession, organism_name, infraspecific_name and asm_name. For example, for GCF_001477545.1, the name could be GCF_001477545.1 Pneumocystis carinii B80 strain=B80, Pneu_cari_B80_V3
  • The taxid field can be used to generate TaxInfo and save it in the Zipped SBT during indexing. Because we control both the name (instead of using --name-from-first) and how it is saved in the TaxInfo, scripts for converting results like gather_to_opal.py can be simplified.
  • We can distribute one database per refseq/genbank subset, so people don't need to download a gigantic one for everything, but if they want to use all of them it's not a problem too (just list them all in gather or search)

More info: https://ftp.ncbi.nlm.nih.gov/pub/factsheets/HowTo_Downloading_Genomic_Data.pdf

build a streamlined Genbank/RefSeq based on GTDB 25k genomes

ref #4 and #7, I think providing a small database using the GTDB 25k genomes, but with NCBI names/taxonomies instead, would be quite useful to many people.

the logic is that:

  • NCBI taxonomy is a mess, but people like it and are used to it;
  • since GTDB 25k is a nice low-redundancy collection of genomes, they're good to match against;
  • so we could provide just those genomes, but with NCBI taxonomy instead of GTDB taxonomy.

I guess we'd want to make sure the names are NCBI names where possible, and we'd want to provide a lineages CSV with it.

see also sourmash-bio/sourmash#969

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.