ctb / magsearch Goto Github PK

View Code? Open in Web Editor NEW

3.0 2.0 0.0 2.88 MB

Workflow and config files for searching (very) large public databases with sourmash sketches

License: GNU Affero General Public License v3.0

Python 1.54% Standard ML 98.46% Makefile 0.01%

sourmash

magsearch's Introduction

Search (very) large public databases with sourmash sketches

Run it like so:

snakemake -s magsearch.snakefile --configfile config.yml -j 32

Needs ~40 GB of RAM, 32 cores.

To run "integration" test on farm, use:

snakemake -s magsearch.snakefile -j 2 --configfile config-test.yml

License & authorship

This repository was originally forked from https://github.com/sourmash-bio/sra_search.

This software is under the AGPL license. Please see LICENSE.txt.

Authors:

Luiz Irber N. Tessa Pierce-Ward C. Titus Brown

magsearch's People

Contributors

Stargazers

Watchers

magsearch's Issues

notes: running magsearch for christy

editing here: https://hackmd.io/EQG9YLZwQGOeoKWjy-fHFg

Running MAGsearch for Christy

Christy G. asked me to run MAGsearch for her, and I thought I'd document it this time!

first, sketch the genomes.

I grabbed all of her genomes and then ran:

sourmash sketch dna -p k=31,scaled=1000 *

in the directory containing the FASTA files.

I then put them in a zip file:

zip -r christy-2022.09.25.zip *.sig

and transferred them to farm (our HPC).

2. unpack the sketches and generate a list

On farm, I went to my MAGsearch directory:

cd ~ctbrown/scratch/magsearch
mkdir query.christy-2022.09.25

and unzipped the sketches:

unzip ~/transfer/christy-2022.09.25.zip

and made a list of the files relative to the base MAGsearch directory:

ls -1 query.christy-2022.09.25/* > query.christy-2022.09.25.txt

3. make a configuration file

I made a new copy of the config file:

cp config.yml config-christy-2022.09.25.yml

and then added the search-specific things:

# unique query name
query_name: christy-2022.09.25

# list of paths of query signatures - 1 or more.
query_sigs: query.christy-2022.09.25.txt

# catalog to search - list of paths of subject signatures
#catalog: /group/ctbrowngrp/sra_search/catalogs/metagenomes
catalog: catalog.sub

# containment threshold to use
threshold: 0.01

# k-mer size to use
ksize: 31

# scaled to use
scaled: 1000

# where to put the results
out_dir: "output.magsearch"

4. start an srun session

Next I started screen and ran a beefy srun:

screen -S magsearch-christy
srun -p high2 --time=48:00:00 --nodes=1 --cpus-per-task 32 --mem 50GB --pty /bin/bash

and ran a test:

snakemake -s magsearch.snakefile --configfile config-christy-2022.09.25.yml -j 32

note that this is a test because I'm only searching a small catalog, catalog.sub - this makes sure the queries etc can all be loaded before we run the thing for a day or two!

5. check logs for test

It looks like all went well:

% cat output.magsearch/logs/sra_search.k31.log
[2022-09-25T12:56:54Z INFO  sra_search] Loading queries
[2022-09-25T12:56:54Z INFO  sra_search] Loaded 27 query signatures
[2022-09-25T12:56:54Z INFO  sra_search] Loading siglist
[2022-09-25T12:56:54Z INFO  sra_search] Loaded 14 sig paths in siglist
[2022-09-25T12:56:54Z INFO  sra_search] Processed 0 search sigs

(the last line is output only every so often, so more than 0 search sigs were processed.)

6. run for realz

Remove test output,

rm output.magsearch/results/christy-2022.09.25.csv

edit the config file like so:

# catalog to search - list of paths of subject signatures
catalog: /group/ctbrowngrp/sra_search/catalogs/metagenomes
#catalog: catalog.sub

and run!

snakemake -s magsearch.snakefile --configfile config-christy-2022.09.25.yml -j 32

updating runinfo CSV -

search at https://www.ncbi.nlm.nih.gov/sra/ for "METAGENOMIC"[Source] NOT amplicon[All Fields]

direct link courtesy of luiz:

https://www.ncbi.nlm.nih.gov/sra/?term=%22METAGENOMIC%22%5BSource%5D%20NOT%20amplicon%5BAll%20Fields%5D

to download file:

send to... summary.

consider updating file https://osf.io/download/762mk/ referenced over at https://github.com/sourmash-bio/2022-search-sra-with-mastiff/blob/main/interpret-sra-live.ipynb

trying out quobyte

quobyte is a new IO intensive space on farm that we're testing out.

description

We have a parallel file-system in ALPHA testing on Farm. It has been configured and works in our simple testing, but we would like you to thrown your worse/heaviest IO at it. For now, I have limited each user to 50 TB and each group to 500 TB of storage. You can access it through /quobyte/hpccf-scratch/scratch/ . This is a parallel file-system, so you will see all of your files on every node, so it is NOT node local.

transferring files

update.sh:

#! /bin/bash
rsync -av /group/ctbrowngrp/irber/data/wort-data/ /quobyte/hpccf-scratch/scratch/wort

du

farm:/quobyte/hpccf-scratch/scratch/wort$ du -sh *
0       manifests
37K     update.sh
459G    wort-genomes
5.9G    wort-img
13T     wort-sra

upshot

I have my own copy of wort to play with! <evil cackle>

use cases

seeing if a sample is in the database / identifying if sample is there

differential privacy/dbgap search (at level of technical replicates)

biogeography - where might I look

discovering more examples of strains/species of an interesting species/genus

outbreak detection - plants and humans and animals / one health

spillover idea/spillover risk

"finding gut microbes" example writ larger

notification service of new matches

content-based (re)annotation of stuff in the SRA

notes: searching for melainabacteria

https://hackmd.io/hX91VlwdShG8YC59vBx06A?view

syrah / 400k data set

associated with: http://ivory.idyll.org/blog/2017-sourmash-sra-microbial-wgs.html

see https://github.com/dib-lab/soursigs/blob/3db6162579e0efbc4fe8f181a7f08c3aee391d97/Snakefile#L149

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=(("biomol dna"[Properties] NOT amplicon[All Fields])) AND "bacteria"[orgn:__txid2] NOT metagenome

updating wort catalog for magsearch

per @bluegenes:

Updates via 1) downloading full list, and 2) using that full list to update the catalog info for magsearch

download here: ~/2022-magsearch-tr/download_catalog.sh

randomize subject data sets to even out 'cost' of big data sets

the paper benchmarking revealed that big data sets consume the most time b/c of their load, as well as the most memory; this is both obvious and might be easy to address by randomizing the input list.