Giter Club home page Giter Club logo

magsearch's Introduction

Search (very) large public databases with sourmash sketches

Run it like so:

snakemake -s magsearch.snakefile --configfile config.yml -j 32

Needs ~40 GB of RAM, 32 cores.

To run "integration" test on farm, use:

snakemake -s magsearch.snakefile -j 2 --configfile config-test.yml

License & authorship

This repository was originally forked from https://github.com/sourmash-bio/sra_search.

This software is under the AGPL license. Please see LICENSE.txt.

Authors:

Luiz Irber N. Tessa Pierce-Ward C. Titus Brown

magsearch's People

Contributors

ctb avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

magsearch's Issues

notes: running magsearch for christy

editing here: https://hackmd.io/EQG9YLZwQGOeoKWjy-fHFg

Running MAGsearch for Christy

Christy G. asked me to run MAGsearch for her, and I thought I'd document it this time!

first, sketch the genomes.

I grabbed all of her genomes and then ran:

sourmash sketch dna -p k=31,scaled=1000 *

in the directory containing the FASTA files.

I then put them in a zip file:

zip -r christy-2022.09.25.zip *.sig

and transferred them to farm (our HPC).

2. unpack the sketches and generate a list

On farm, I went to my MAGsearch directory:

cd ~ctbrown/scratch/magsearch
mkdir query.christy-2022.09.25

and unzipped the sketches:

unzip ~/transfer/christy-2022.09.25.zip

and made a list of the files relative to the base MAGsearch directory:

ls -1 query.christy-2022.09.25/* > query.christy-2022.09.25.txt

3. make a configuration file

I made a new copy of the config file:

cp config.yml config-christy-2022.09.25.yml

and then added the search-specific things:

# unique query name
query_name: christy-2022.09.25

# list of paths of query signatures - 1 or more.
query_sigs: query.christy-2022.09.25.txt

# catalog to search - list of paths of subject signatures
#catalog: /group/ctbrowngrp/sra_search/catalogs/metagenomes
catalog: catalog.sub

# containment threshold to use
threshold: 0.01

# k-mer size to use
ksize: 31

# scaled to use
scaled: 1000

# where to put the results
out_dir: "output.magsearch"

4. start an srun session

Next I started screen and ran a beefy srun:

screen -S magsearch-christy
srun -p high2 --time=48:00:00 --nodes=1 --cpus-per-task 32 --mem 50GB --pty /bin/bash

and ran a test:

snakemake -s magsearch.snakefile --configfile config-christy-2022.09.25.yml -j 32

note that this is a test because I'm only searching a small catalog, catalog.sub - this makes sure the queries etc can all be loaded before we run the thing for a day or two!

5. check logs for test

It looks like all went well:

% cat output.magsearch/logs/sra_search.k31.log
[2022-09-25T12:56:54Z INFO  sra_search] Loading queries
[2022-09-25T12:56:54Z INFO  sra_search] Loaded 27 query signatures
[2022-09-25T12:56:54Z INFO  sra_search] Loading siglist
[2022-09-25T12:56:54Z INFO  sra_search] Loaded 14 sig paths in siglist
[2022-09-25T12:56:54Z INFO  sra_search] Processed 0 search sigs

(the last line is output only every so often, so more than 0 search sigs were processed.)

6. run for realz

Remove test output,

rm output.magsearch/results/christy-2022.09.25.csv 

edit the config file like so:

# catalog to search - list of paths of subject signatures
catalog: /group/ctbrowngrp/sra_search/catalogs/metagenomes
#catalog: catalog.sub

and run!

snakemake -s magsearch.snakefile --configfile config-christy-2022.09.25.yml -j 32

trying out quobyte

quobyte is a new IO intensive space on farm that we're testing out.

description

We have a parallel file-system in ALPHA testing on Farm. It has been configured and works in our simple testing, but we would like you to thrown your worse/heaviest IO at it. For now, I have limited each user to 50 TB and each group to 500 TB of storage. You can access it through /quobyte/hpccf-scratch/scratch/ . This is a parallel file-system, so you will see all of your files on every node, so it is NOT node local.

transferring files

update.sh:

#! /bin/bash
rsync -av /group/ctbrowngrp/irber/data/wort-data/ /quobyte/hpccf-scratch/scratch/wort

du

farm:/quobyte/hpccf-scratch/scratch/wort$ du -sh *
0       manifests
37K     update.sh
459G    wort-genomes
5.9G    wort-img
13T     wort-sra

upshot

I have my own copy of wort to play with! <evil cackle>

use cases

seeing if a sample is in the database / identifying if sample is there

differential privacy/dbgap search (at level of technical replicates)

biogeography - where might I look

discovering more examples of strains/species of an interesting species/genus

outbreak detection - plants and humans and animals / one health

spillover idea/spillover risk

"finding gut microbes" example writ larger

notification service of new matches

content-based (re)annotation of stuff in the SRA

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.