Giter Club home page Giter Club logo

Comments (8)

luizirber avatar luizirber commented on July 23, 2024

I used it to create a mirrored subset of genbank and refseq (archaea, bacteria, fungi, viral, protozoa), and really liked it. I don't have time to make a pull request now, but some blueprints:

  • genome_updater.sh supports recovering files from a previously obtained assembly summary file with the -e option. This is really useful for reproducing how a database was built, and we can stored it somewhere (osf? zenodo?)
  • Instead of finding all files in a dir, use the assembly_summary.txt to select files (from a domain like 'archaea' or 'fungi').
  • From previous item: build separate DBs for each domain. This is something that the euk SBTs already do (see https://github.com/dib-lab/2018-euk-SBTs and https://osf.io/a46zr/)
  • Another benefit of using the assembly_summary.txt: We don't depend on acc2taxid to get the taxonomy of each dataset. This can also be a mapping generated during database construction, and saved in the zipped SBT (or whatever taxonomy format we end up adopting in sourmash: sourmash-bio/sourmash#969

I think most of these items fit sourmash-bio/sourmash#970 too.

from databases.

ctb avatar ctb commented on July 23, 2024

cool, I'll look into using it - is there a seed on farm that I can start from, or shall I just re-download the whole shebang?

from databases.

luizirber avatar luizirber commented on July 23, 2024

Re-download, I did this in my machine.

from databases.

ctb avatar ctb commented on July 23, 2024

Very good experience so far --

# make group directory
cd /group/ctbrowngrp
mkdir ncbi-genomes
cd ncbi-genomes

# install genome_updater
13820  conda create -n genupd -y genome_updater 
13821  conda activate genupd

# grab all shewanella genomes - dry run
genome_updater.sh -g "taxids:22" -k -m -o shew

# with one thread
genome_updater.sh -g "taxids:22" -m -o shew

# with four threads, + fix directory (-i)
genome_updater.sh -g "taxids:22" -m -o shew -i -t 4

from databases.

luizirber avatar luizirber commented on July 23, 2024

Another benefit of using the assembly_summary.txt: We don't depend on acc2taxid to get the taxonomy of each dataset. This can also be a mapping generated during database construction, and saved in the zipped SBT (or whatever taxonomy format we end up adopting in sourmash: dib-lab/sourmash#969

Also: instead of using --name-from-first, set the --name based on the assembly_summary.txt. This gives us better control on the right accession number, or even using the GCF/GCA identifiers instead. More info: luizirber/2020-cami#1

from databases.

ctb avatar ctb commented on July 23, 2024

(yep, that's how the GTDB snakemake does it)

from databases.

ctb avatar ctb commented on July 23, 2024

note to self, put things in /group/ctbrowngrp/

from databases.

ctb avatar ctb commented on July 23, 2024

closing in favor of our modern database build practices.

from databases.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.