arcadia-science / sourmashconsumr Goto Github PK

View Code? Open in Web Editor NEW

21.0 7.0 3.0 5.44 MB

Working with the outputs of sourmash in R

Home Page: https://arcadia-science.github.io/sourmashconsumr/

License: Other

R 25.72% Standard ML 73.57% Shell 0.71%

sourmash

sourmashconsumr's Introduction

sourmashconsumr

The goal of sourmashconsumr is to parse, analyze, and visualize the outputs of the sourmash python package. The sourmashconsumr package is still under active development.

Installation

You can install the development version of sourmashconsumr from GitHub with:

# install.packages("remotes")
remotes::install_github("Arcadia-Science/sourmashconsumr")

Eventually, we hope to release sourmashconsumr on CRAN and to provide a conda-forge package. We’ll update these instructions once we’ve done that.

Usage

See the vignette for full instructions on how to run the sourmashconsumr package (coming soon!).

To access the functions in the sourmashconsumr package, you can load it with:

library(sourmashconsumr)

The sourmashconsumr package contains a variety of functions to work with the outputs of the sourmash python package. The table below summarizes which sourmash outputs the sourmashconsumr package operates on and the functions that are available. For a complete list of functions in the sourmashconsumr package, see the documentation.

Developer documentation

The sourmashconsumr package follows package developer conventions laid out in https://r-pkgs.org/, and changes can be contributed to the code base using pull requests. For more information on how to contribute, see the developer documentation.

Citation

If you use sourmashconsumr in your work, please cite DOI: 10.57844/arcadia-1896-ke33.
If you use sourmash in your work, please cite DOI: 10.21105/joss.00027.

If you’d like more information on how sourmash works, please see the following publications:

For a general background on how sourmash works and examples of how to use it: Large-scale sequence comparisons with sourmash
For a mathematical description of FracMinHash and a demonstration of the accuracy of sourmash gather: Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers

sourmashconsumr's People

Contributors

Stargazers

Watchers

Forkers

bluegenes ctb

sourmashconsumr's Issues

refactor `tax_glom_taxonomy_annotate()` to only have piped code block occur once by auto-inheriting `glom_var`

In #37, I changed tax_glom_taxonomy_annotate() so that the user can select a glom_var. Right now you can only chose n_unique_kmers or f_unique_to_query. If I continue to expand the possible glom_vars, I'll refactor the code so that it uses the glom_var smartly and only has the piped code block once. It seemed like too much of a lift for something that might not even be that useful to implement this in #37.

for `read_gather` and `read_taxonomy_annotate`, figure out how to set column type by column name

the current character string feels pretty brittle. Also I think as versions of sourmash changes, the gather output changes, so I'd like to have a more robust read function in place.

change `plot_taxonomy_annotate_ts_alluvial()` to plot `f_unique_weighted` by default

Will need to address #38 first.

for `plot_taxonomy_annotate_ts_alluvial()`, add a `show_tax` argument that allows the user to control which taxa are given alluvial ribbons

motivated by a suggestion by @elizabethmcd in #37 and inspired by show_tax in ampvis2 https://kasperskytte.github.io/ampvis2/articles/ampvis2.html

How the function works right now is it uses a fraction_threshold (by default, 0.01, or 1%) -- if a lineage is present in any of the time series at 1% or greater, it gets an alluvial ribbon in the plot. The user can change the fraction_threshold to anything they want it to be. Anything that does not get an alluvial ribbon gets automatically clumped into "other" via a process implemented in the function.

I like the idea of tax_show. This would allow users to either provide a list of taxa to tax_show or use fraction_threshold.

Functions for importing output of `sourmash taxonomy annotate` into metacoder object

metacoder visualization
- read_sourmash_taxonomy_annotate(file, intersect_bp_threshold)
- pivot_sourmash_taxonomy_annotate_wider()
- sourmash_taxonomy_annotate_to_metacoder(sourmash_taxonomy_annotate_df, database = c("genbank", "gtdb"), summary_level = c(NULL, "genus", ...))
  - database will control class_regex for parse_tax_data()
  - summary_level will control if the sourmash results are agglomerated up the taxonomic lineage during the creation of the metacoder object (e.g. to genus level).
  - sequence of functions:
    - read_sourmash_taxonomy_annotate() to purrr::map_dfr
    - pivot_sourmash_taxonomy_annotate_wider()
    - parse_tax_data()
    - calc_taxon_abund()
    - calc_n_samples()
  - goal is to do everything that a user would need to think of doing to get the data into metacoder land to enable visualization. Check and see if there is something that needs to be done for diff abund vix/matrix viz.

document rules for naming functions

Naming functions

Functions that are exported (e.g. user-facing) are named by the action completed by the function, the sourmash output type the act on, and if relevant, a description of the action taken.

Action words:
- read
- plot
- from
sourmash output types:
- signature
- compare_csv
- gather
- taxonomy_annotate
example actions:
- to_metacoder
- upset
- heatmap
- mds

Functions that are not exported do not follow a naming scheme but strive to be fully descriptive of their actions, and when possible use the sourmash output types to make it clear what type of data the internal function operates on.

examples of internal functions
- check_compare_df_sample_col_and_move_to_rowname()
- check_and_edit_names_in_signatures_df()
- check_uniform_parameters_in_signatures_df()
- make_agglom_cols()
- make_expression()
- get_scaled_for_max_hash()
- pivot_wider_taxonomy_annotate()

add advice to use the `--name` flag in `sketch` to vignette

to get max use out of the sourmashconsumr package

remove themes from ggplots, or at least make sure the same themes are used throughout

to make sure there is a consistent user experience.

plot_compare_mds I think uses theme_classic, while plot_signatures_rarefaction doesn't have a theme.

I think not having a theme is probably the right what to go? except for alluvial plots and sankey plots are more rewarding with a blank background, so maybe theme_classic is a good default.

enable taxonomy plotting with LIN taxonomic framework?

In sourmash taxonomy, we're adding utils to use the LIN taxonomic framework, which allows for greater flexibility and specificity compared with standard taxonomic ranks. For example, if only certain strains of a microbe are pathogenic, the LIN framework may be useful for identifying/grouping pathogenic vs non-pathogenic strains.

Is this something you're interested in allowing for viz? Though LINs aren't super widely used yet, I think they have neat potential for sourmash applications.

LIN concept example (ref https://doi.org/10.1093/nar/gkaa190):

add color to the compare plots

Right now, the compare plot looks like this:

comp <- read_compare_csv("tests/testthat/comp_k31.csv")
mds <- make_compare_mds(compare_df = comp)
plt <- plot_compare_mds(mds)
plt

It might be nice to have this plot accept colors optionally:

I went to implement this, but it wasn't clear to me what the best way to do this would be. I decided to leave this as-is for now, and then as I use the functions, i think it will become clear how I interact with this and then I'll add it to the function.

Similarly, it would be cool to color the axis labels or something by sample type or group for the heatmap:

Again, don't know how to do this in a way that will be intuitive to downstream users yet, so will do later!

code i used to figure out how to the make the sankey plot

no promises that it runs, but recording here so it's somewhere

library(ggalluvial)
library(magrittr)
library(sourmashconsumr)

taxonomy_annotate_df <- read_taxonomy_annotate(Sys.glob("tests/testthat/SRR19*lineage*.csv"), separate_lineage = T) %>%
  dplyr::select(f_unique_to_query, f_unique_weighted, domain, phylum, class, family, order, genus, species) %>%
  dplyr::group_by(domain, phylum, class, family, order, genus, species) %>%
  dplyr::summarize(sum_f_unique_weighted = sum(f_unique_weighted))

ggalluvial::is_alluvia_form(taxonomy_annotate_df)


ggplot2::ggplot(taxonomy_annotate_df,
       ggplot2::aes(y = sum_f_unique_weighted, axis1 = domain, axis2 = phylum, axis3 = class, axis4 = order, axis5 = family)) +
  #ggalluvial::geom_alluvium(aes(fill = order), width = 1/12) +
  ggalluvial::geom_flow() +
  ggalluvial::geom_stratum(width = 1/10, alpha = .5, aes(fill = c(family))) +
  ggplot2::geom_text(stat = "stratum", aes(label = after_stat(stratum)),
                     size = 2, hjust = -0.25) +
  theme_classic() +
  labs(x = "tanomic rank", y = "abundance-weighted unique fraction\ntotaled across all samples") +
  scale_x_continuous(labels = c("domain", "phylum", "class", "order", "family"),
                     breaks = c(1, 2, 3, 4, 5))


ggplot(taxonomy_annotate_df,
       aes(x = survey, stratum = response, alluvium = subject,
           y = freq,
           fill = response, label = response)) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  geom_text(stat = "stratum", size = 3) +
  theme(legend.position = "none") +
  ggtitle("vaccination survey responses at three points in time")


taxonomy_annotate_df <- read_taxonomy_annotate(Sys.glob("tests/testthat/*lineage*.csv"), separate_lineage = T) %>%
  dplyr::select(query_name, f_unique_to_query, f_unique_weighted, domain, phylum, class, family, order, genus, species) %>%
  dplyr::group_by(query_name, domain, phylum, class, family, order, genus, species) %>%
  dplyr::summarize(sum_f_unique_weighted = sum(f_unique_weighted))
to_lodes_form(taxonomy_annotate_df_long)

# create a fill variable --> it will be based on alphabetical order (which is how the alluvial plot is ordered)
# and it will be for each level of taxonomy
# probably needs to switch to long format
taxonomy_annotate_df_long <- taxonomy_annotate_df %>%
  tidyr::pivot_longer(cols = domain:species, names_to = "taxonomic_rank", values_to = "taxonomic_label")

taxonomy_annotate_df_long <- transform(taxonomy_annotate_df_long, taxonomic_label = factor(taxonomic_label))
to_lodes_form(taxonomy_annotate_df_long)
ggplot(taxonomy_annotate_df_long,
       aes(x = taxonomic_rank, stratum = taxonomic_label, alluvium = query_name,
           y = sum_f_unique_weighted,
           fill = taxonomic_label, label = taxonomic_label)) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  geom_text(stat = "stratum", size = 3) +
  theme(legend.position = "none") +
  ggtitle("alluvial plot")

# test data ---------------------------------------------------------------

data(vaccinations)
vaccinations <- transform(vaccinations,
                          response = factor(response, rev(levels(response))))
ggplot(vaccinations,
       aes(x = survey, stratum = response, alluvium = subject,
           y = freq,
           fill = response, label = response)) +
  scale_x_discrete(expand = c(.1, .1)) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  geom_text(stat = "stratum", size = 3) +
  theme(legend.position = "none") +
  ggtitle("vaccination survey responses at three points in time")


# try parallel sets -------------------------------------------------------

data <- reshape2::melt(Titanic)
data <- gather_set_data(data, 1:4)
data

data <- gather_set_data(taxonomy_annotate_df, 1:7)
palette <- colorRampPalette(RColorBrewer::brewer.pal(8, "Set2"))(length(unique(data$y)))
ggplot(data, aes(x, id = id, split = y, value = sum_f_unique_weighted)) +
  geom_parallel_sets(alpha = 0.3, axis.width = 0.1) +
  geom_parallel_sets_axes(axis.width = 0.2, aes(fill = y)) +
  geom_parallel_sets_labels(colour = 'black', angle = 360, size = 2, hjust = -0.25) +
  theme_classic() +
  theme(axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.ticks.x = element_blank(),
        legend.position = "None") +
  labs(x = "tanomic rank") +
  scale_x_continuous(labels = c("domain", "phylum", "class", "order", "family", "genus", "species", ""),
                     breaks = c(1, 2, 3, 4, 5, 6, 7, 8),
                     limits = c(.75, 8)) +
  scale_fill_manual(values = palette)

Change some variable names in R/metacoder.R

summary_level to agglomeration_level or taxglom_level: summary level isn't clear. it should be made more clear this is for agglomeration.
switch taxonomy_annotate_tibble to taxonomy_annotate_df:
naming it a tibble is sort of annoying, plus a tibble is technically still a data frame. Changing this would make it match with how signature data frames are referred to (signatures_df). Plus df is shorter than tibble, which is nice.
change taxonomy_annotate_to_metacoder() to from_taxonomy_annotate_to_metacoder()

documenting `plot_taxonomy_annotate_ts_alluvial()` and output

taxonomy_annotate_df <- read_taxonomy_annotate(Sys.glob("~/github/2022-prjna853785-sourmash/outputs/sourmash_taxonomy/SRR*lineages*csv"))

tmp <- readr::read_csv("https://raw.githubusercontent.com/Arcadia-Science/2022-prjna853785-sourmash/main/inputs/metadata.csv") %>%
  select(query_name = run_accession, time = age_months)

plot_taxonomy_annotate_ts_alluvial(taxonomy_annotate_df, time_df = tmp, tax_glom_level = "genus")

converting tax annotate files to phyloseq object

I have been using sourmashconsumr to convert phyloseq objects, but I keep getting the error:

Error in validObject(.Object) :
invalid class “sample_data” object: Sample Data must have non-zero dimensions.

I have ensured there are no 0 within the data frames, and the sample name in the dataframe is correct, but I still have the error. Any advice as to what could be causing this error?

Code used below for reference-

#read in CSV
taxonomy_annotate_df <- read_csv("sample1.51gtdb.with-lineages.csv")
head(taxonomy_annotate_df)

#metadata- new dataframe from existing data
query_name <- c("sample1.fq")
metadata <- data.frame(query_name = query_name)

#Replace Zero with NA Value in a dataframe
taxonomy_annotate_df [taxonomy_annotate_df == 0] <- NA

#Converting from taxonomy annotate to phyloseq object
sample1_phyloseq <- from_taxonomy_annotate_to_phyloseq(taxonomy_annotate_df = taxonomy_annotate_df,
metadata_df = metadata %>%
tibble::column_to_rownames("query_name"))

make "other" the last level in ts alluvial plot legend

update vignette so it reflects the new palette changing abilities for the taxonomy and gather upset plots

add functions for importing output of `sourmash taxonomy annotate` to phyloseq object

https://taylorreiter.github.io/2022-07-28-From-raw-metagenome-reads-to-phyloseq-taxonomy-table-using-sourmash-gather-and-sourmash-taxonomy/

sourmash-bio/sourmash#2289

phyloseq integration
- sourmash_taxonomy_annotate_to_phyloseq()
  - read_sourmash_taxonomy_annotate()
  - sourmash_taxonomy_annotate_to_tax_table()
  - sourmash_taxonomy_annotate_to_count_table()

for gather upset plot, add another barchart on top that shows the fraction of sample that was taken up by the intersection

This could be calculating by summing over f_unique_weighted for the things in the intersection. Then I could color by the query sample.

It might be super annoying to implement this, so posting as an issue and leaving as-is for now.

Add visualizations for specific use cases like time series or different groups

So far I've been focused on visualizations that will work no matter if samples are highly related, time series, different groups with lots of replicates, large or small sample sizes, etc. I think now that some of these base visualizations are encoded, I can do some more specific things as they come up.

Brainstorming below!

Time series

from: The temporal dynamics of the tracheal microbiome in tracheostomised patients with and without lower respiratory infections. August 2017PLoS ONE 12(8):e0182520 DOI:10.1371/journal.pone.0182520

Differential abundance

Show in a vignette how to go from from_sourmash_taxonomy_to_metacoder to the differential heat tree viz (viz from the metacoder vignette)'

Visualization when we have a tree

When GTDB is the database, we have a tree we can use to build visualizations (although we would have to have a function to download it, and that might get annoying):
from: https://www.nature.com/articles/s41579-021-00562-3

switch `tax_glom*`/agglomeration language to aggregate

@elizabethmcd pointed out in #23:

This might just be a personal thing, but the term agglomeration and referring to the function tax_glom_taxonomy_annotate seems a little confusing and maybe doesn't clearly convey what this function is doing. I think in R the similar but more known action is aggregating and people might be more familiar with this? Up to you.

I was copying the syntax/naming of the phyloseq function that does this: https://rdrr.io/bioc/phyloseq/man/tax_glom.html

I wanted to record this feedback because if we keep getting it then I want to change the wording for the function.

installation fails on R 4.2.2 on Linux/i386, installed via conda.

with R installed via the following conda environment spec,

name: env
channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - python>=3.8
    - snakemake-minimal>=7.19.1,<8
    - sourmash>=4.6,<5
    - curl
    - r-ggplot2
    - r-pheatmap
    - r-viridis
    - r-ggplotify
    - r-rmarkdown

result in the set of installed packages (mamba list output attached), running the remotes::install command in the README results in:
mamba-list.txt

...
* checking for file ‘/tmp/Rtmp32aOFf/remotes716552f5280e/Arcadia-Science-sourmashconsumr-9ceaa18/DESCRIPTION’ ... OK
* preparing ‘sourmashconsumr’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘sourmashconsumr_0.1.0.tar.gz’
ERROR: dependencies ‘httr’, ‘metacoder’, ‘phyloseq’ are not available for package ‘sourmashconsumr’

Not sure there's anything you can do about this, but wanted to document it here ;).

package name: change to something that makes it clear that this package doesn't encode the sourmash functionality in R, but consumes the outputs of sourmash and does stuff to them

Following convention, I'd like to avoid putting punctuation in the name.

Names that don't make it clear that this package does not re-implement the core sourmash functionality

rourmash
sourmashR
souRmash (also this one is bad because bc it's basically the same as sourmash)

Names that are a catchall

sourmashRutils

Maybe better ideas

sourmashconsumR

make the default theme for the upper half of all of the upset plots `theme_classic()`

documentation here: https://krassowski.github.io/complex-upset/articles/Examples_R.html#substituting-themes

upset(movies, genres, min_size=10, themes=list(default=theme()))

add functions to visualize and interrogate overall taxonomy results

Like are used in the notebook here: https://github.com/Arcadia-Science/2022-prjna853785-sourmash/blob/main/notebooks/20220815-visualize-sourmash-taxonomy-results.ipynb

Visualizations that I think are worth including:

fraction of sample matched/unclassified colored by database
a. maybe add a low-confidence portion -- taxonomic matches that had less than 50kb in the entire sample. I could used the paired palette for this -- high confidence bacteria, low confidence bacteria, etc.
upset plot of shared lineages
a. would be nice to choose which level of taxonomy this plot is made at
ability to dig into intersections from the upset plots

make a vignette per sourmash output type

signatures (output by sourmash sketch or sourmash compute):
- read_signature(), show how to read multiple signatures using purrr,
- upset plots: from_signatures_to_upset_df(), plot_signatures_upset()
- rarefaction plots for signatures sketched from reads: from_signatures_to_rarefaction_df(), plot_signatures_rarefaction()
sourmash compare csv:
- read_compare()
- MDS plot: make_compare_mds(), plot_compare_mds()
- heatmap: plot_compare_heatmap()
sourmash taxonomy annotate csv
- read_taxonomy_annotate()
- taxonomy agglomeration: tax_glom_taxonomy_annotate()
- upset plot: from_taxonomy_annotate_to_upset_inputs(), plot_taxonomy_annotate_upset()
- sankey plot: plot_taxonomy_annotate_sankey()
- time series alluvial plot: plot_taxonomy_annotate_ts_alluvial()
- to metacoder: from_taxonomy_annotate_to_metacoder()
- to phyloseq: from_taxonomy_annotate_to_phyloseq()
sourmash gather csv
- read_gather()
- barchart: plot_gather_classified()
- upset plot: from_gather_to_upset_df(), plot_gather_upset()
upset utilities
- from_list_to_upset_df()
- from_upset_df_to_intersection_members()
- from_upset_df_to_intersection_summary()
- from_upset_df_to_intersections()

order sankey plot by most frequently occurring level instead of alphabetical?

make the palette for the taxonomy annotate upset plot a user-specifiable parameter

use scale_fill_manual(values =) to specify instead. then user will need to specify a vector of e.g. hex codes or colors to control the color values

`n_unique_kmers` doesn't exist

Had an error shared with me (🎉):

Error in `dplyr::select()`:
! Can't subset columns that don't exist
x Column `n_unique_kmers` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.

I see that the n_unique_kmers column is added during read_taxonomy_annotate, so the error is likely caused by using read_csv rather than read_taxonomy_annotate to read the file.

Would it be worth changing this internal column to n_unique_weighted_found to avoid this error for sourmash v4.5+, since we have the column now? We figured this name more clearly described the column info, but I'm not sure we discussed outside of the sourmash PR that added it.

Or if you want to force folks to use read_taxonomy_annotate (I see you do a couple other things in there) is there a way to catch the error + suggest the solution?

thanks for the awesome software!

example of making rarefaction curves from signatures representing fastq files

remotes::install_github("Arcadia-Science/sourmashconsumr")
library(sourmashconsumr)
library(dplyr)
library(ggplot2)
library(purrr)

sigs <- Sys.glob("*100k.sig") %>%
  map_dfr(read_signature) %>%
  filter(ksize == 21)

rarefaction_df <- from_signatures_to_rarefaction_df(sigs)
plot_signatures_rarefaction(rarefaction_df) # +
  # theme_minimal() +
  # geom_point(aes(color = name))

uncomment lines to get colored curves and no grey background.

functions to read the outputs of sourmash

read_taxonomy_annotate
read_gather
read_compare_csv
read_signature_describe
read_signature_csv
- wait to implement until there is a standardized function in sourmash sourmash-bio/sourmash#1098

refactor `plot_taxonomy_annotate_ts_alluvial()` to allow more flexible column names for `time_df`.

In #37, I implemented a function plot_taxonomy_annotate_ts_alluvial() that produces an alluvial flow plot for time series metagenomes. It takes as input a time_df, which I made so that it has to have the column names query_name and time. It would be probably be good to generalize this...I'll think about doing that if it becomes annoying that it isn't generalized.

for `read*()` functions, check if file exists and output and error if it doesn't

Otherwise, when a non-existent file is supplied:

> Sys.glob("~/github/2022-strains/SRR/SRR*lineage*csv")
character(0)

You get this entirely unhelpful error message:

Error in tidyr::separate(., .data$lineage, into = c("domain", "phylum", : 
object 'taxonomy_annotate_df' not found

add functions for rarefaction for groups of signatures using vegan

like used here (note all of these links are to the specaccum branch which will be deleted after Arcadia-Science/2022-mtx-not-in-mgx-pairs#9 is merged):

Only makes sense to run signatures with abundances calculated from reads. Also only really makes sense when it's run on many signatures from the same sample.

requires signatures to be read into a data frame (see #4)

switch some packages that are Imported (e.g. must be installed) to Suggests (can install and load the library if suggested packages aren't installed)

metacoder: only used for one function at the moment
phyloseq: will only be needed to read taxonomy output to taxonomy table
vegan: will only be used for rarefaction curves
complexUpset/upsetr: will only be used for upset plots

I think I would like if only tidyverse packages are imports, and everything else is a suggest. TBD though :)

allow users to choose to name from filename or name from query name when plotting things that have multiple samples

because some people won't use the --name flag in sketch and it will be empty

change `read_signature()` to read in from one or many files

both read_gather() and read_taxonomy_annotate() automatically determine whether a user provided one file path or many file paths, and then read all of the files into a single data frame. read_signature() currently doesn't do that...it only works on one file. But it's simple to make it read many using purrr::map_dfr(read_signature)...so I should implement that so that the user experience for the functions are consistent.

add `tidyr::drop_na()` to sankey plot function to avoid errors/warnings and inaccurate plots

I had three data points with NAs in a recent sankey plot and i got the following errors and warnings and weird looking plots. this could be fixed with a drop_na filtering step. Could be parameterized, or just documented so the user knows this is happening.

Warning messages:
1: Removed 3 rows containing non-finite values (`stat_parallel_sets()`). 
2: Removed 3 rows containing non-finite values (`stat_parallel_sets_axes()`). 
3: Computation failed in `stat_parallel_sets_axes()`
Caused by error in `compute_panel()`:
! Axis aesthetics must be constant in each split 
4: Removed 3 rows containing non-finite values (`stat_parallel_sets_axes()`). 
5: Computation failed in `stat_parallel_sets_axes()`
Caused by error in `compute_panel()`:
! Axis aesthetics must be constant in each split

Alpha diversity estimation

Hello and thanks for the awesome tool.

I have a question, I see you efficiently introduced a method to plot and represent beta-diversity between samples (dissimiliarities).

I was thinking, what is the best way to represent alpha diversity? is the just the amount of tax detected by sourmash taxonomy? the total number of sketches, or the slope like in the tutorial?
What is the most correct way to represent richness of a community? I think people would still love to see total number of species detected. But maybe a rarefaction curve with kmers should be reported too, supporting the result?

Thanks, sorry if the question, I am still a noob in metagenomics.

add a new visualization for sourmash taxonomy annotate that's like the `plot_gather_classified()` plot

but allow agglomeration up levels of taxonomy

Set up CI and document development environment and how to develop

Brought up in #8

Coloring strain plots and why I decided not to implement it for now

Over in #50, I implemented a function that works with the sourmash taxonomy annotate output to detect whether multiple strains of a given species in a metagenome sample have multiple strains present or not. I toyed with the idea of trying to count the number of strains likely present mostly by clustering the abundances of the matched genomes. I would then color each matched genome by the strain that I guessed it belonged to (strain1, strain2, strain3, etc.). I've decided to punt on this for now because I don't think the gather/taxonomy output have enough information to do this well -- While different strains may sometimes cluster by abundance, I think it's likely that the first genome match will scoop in k-mers from multiple strains, and because we mostly report average k-mer abundance in the gather output, deconvolving these abundances is basically impossible. I think the right thing to do here would be to take a genome-grist esque approach where for a given set of genome matches within a species, we download all of them and iteratively map k-mers or reads to those genomes. Then we could use an expectation maximization algorithm to assign k-mers/reads or a genome. Alternatively, we could align to everything at once and then still use an EM algorithm that takes advantage of all the read mapping info to do the assignation. This would be a big lift for relatively little payoff -- the perk of this sourmash approach is that it's fast, and the idea is that you could use it to detect strain variation and then used heavier tools to dig in. Do a big mapping and then EM would be a big separate endeavor.

Dumping some code i ripped out of the function that dealt with abundances and trying to guess how many strains were present

abundances

  # ABUNDANCE -- I don't actually know what to do here, so to start,
  # I'm just coding to flag species where average kmer abundances for genomes deviate by more than 2.
  # average_abund <- taxonomy_annotate_df %>%
  #   dplyr::filter(.data$species %in% more_than_one_genome_observed_for_species$species) %>% # filter to species with more than one genome observed
  #   dplyr::group_by(query_name, species) %>%
  #   dplyr::summarise(min_average_abund = min(average_abund),
  #                    max_average_abund = max(average_abund),
  #                    sd_average_abund = sd(average_abund)) %>%
  #   dplyr::mutate(range_average_abund = max_average_abund - min_average_abund)

  #average_abund_filtered <- average_abund %>%
  #  dplyr::filter(range_average_abund >= 10)

guessing strains present

  # the below code won't work, but I think this logic could be used to draw delineations to count the number of strains.
  # I'm not totally sure yet how to get this logic to work with facet_wrap() to show colors --
  # probably something like calculating it in a different data frame and then joining it to the df that's plotted.
  # I would probably make a column like "strain" where I would label each dot "strain1", "strain2", etc. based on which intervals in the sd that the abundance falls in.
  # seq(average_abund$min_average_abund, average_abund$max_average_abund, by = average_abund$sd_average_abund)

  # I think I could also use logic to label potential prophages -- something like less than 3% of the genome with >100 more abundant than any other match for that species.
  # I need to validate this first though, potentially using SRR492184 Enterococcus faecalis.
  # Genome-grist on this sample would probably be the easiest thing to do.

  # plot_df <- taxonomy_annotate_df %>%
  #   dplyr::mutate(query_name_species = paste0(.data$query_name, "-", .data$species)) %>%
  #   dplyr::filter(.data$query_name_species %in% f_match_filtered$query_name_species)
  #
  # # label with strain count guesses before plotting
  # for(query_name_species in unique(plot_df$query_name_species)){
  #   print(query_name_species)
  # }

Add to docs that gzip'd sigs can be read in by `read_signature()`

add a sankey diagram as an overview of taxonomy

Example: https://github.com/fbreitwieser/pavian/blob/cd2f2173f6ad86c49e3af6dcc2407e96874de674/R/sample-build_sankey_network.R#L58

Alternative to a metacoder plot