lebeerlab / tidytacos Goto Github PK

View Code? Open in Web Editor NEW

9.0 1.0 1.0 18.6 MB

Functions to manipulate and visualize microbial community data

Home Page: https://lebeerlab.github.io/tidytacos/

License: GNU General Public License v3.0

R 93.36% TeX 6.64%

microbial-communities microbiome-analysis r tidy visualization

tidytacos's Introduction

tidytacos

Overview

Tidytacos (tidy TAxonomic COmpositionS) is an R package for the exploration of microbial community data. Such community data consists of read counts generated by amplicon sequencing (e.g. a region of the 16S rRNA gene) or metagenome (shotgun) sequencing. Each read count represents a number of sequencing reads identified for some taxon (an ASV, OTU, species, or higher-level taxon) in a sample.

Tidytacos builds on the tidyverse created by Hadley Wickham: the data are stored in tidy tables where each row is an observation and each column a variable. In addition, the package supplies a set of "verbs": functions that take a tidytacos object as first argument and also return a tidytacos object. This makes it easy to construct "pipe chains" of code that represent series of operations performed on the tidytacos object.

Prerequisites

Tidytacos is an R package. You can find instructions to download and install R here.

Tidytacos relies on the tidyverse R package (or, more accurately, set of R packages). You can install the tidyverse by running the following R code:

install.packages("tidyverse")

Finally, RStudio is a nice IDE to work with R code (as well as code in other scripting languages). It has a lot more features than what the default R IDE allows: beyond creating and saving scripts, it also shows your figures, allows you to navigate files, allows you to inspect tables etc. You can download RStudio here.

Installation

Run the following R code to install the latest version of tidytacos:

install.packages("devtools")
devtools::install_github("LebeerLab/tidytacos")

Documentation

A documentation page (help page) is available for all functions in the browser or in R. You can view it in R by running e.g. ?filter_samples. Some useful tutorials can be found on the wiki.

tidytacos's People

Contributors

Stargazers

Watchers

Forkers

justicengom

tidytacos's Issues

tacoplot_alphas: samples with missing alpha diversity values removed only at plotting stage

tacoplot_ord notifies the user and drops empty samples up front, maybe tacoplot_alphas could do something similar?

When I applied tacoplot_alphas directly on a tidytacos object it was quite mysterious to have samples dropped from the plot.

Or maybe a warning in add_alpha is also a good idea, saying something like: N samples were empty, returning NA for those samples.

library(tidytacos)
urt_s <- urt %>% 
  filter_samples(method == "S") %>% 
  add_alphas(methods = "shannon") 


urt_s %>% tacoplot_alphas(group_by = location)
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_ydensity()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).

sum(is.na(urt_s$samples$shannon))
#> [1] 2

^{Created on 2024-07-26 with reprex v2.1.1}

docs need instructions for users to import their own raw data

The quick start guide directs users to the source code to learn about more options for importing data.

More options to import and convert your data can be found here.

I strongly discourage this. Novice users should not be expected to read the source code to learn how to use the package. This information should be distilled into a vignette for users to read.

(Originally posted by @kelly-sovacool in #50 (comment))

[BUG] minor warnings during R CMD build

Describe the bug

R CMD build emits warnings during installation:

── R CMD build ────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/private/var/folders/yb/f86k3qcj10nfyx57x30_dgnnsds0t8/T/RtmpTemN5J/remotes178f71dc179bf/LebeerLab-tidytacos-61b451d/DESCRIPTION’ ...
─  preparing ‘tidytacos’:
─  checking DESCRIPTION meta-information ...Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
     It is recommended to use ‘given’ instead of ‘middle’.
    OK
   Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
     It is recommended to use ‘given’ instead of ‘middle’.
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘tidytacos_1.0.0.tar.gz’
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'

Installation still proceeds successfully, so this is not a blocking problem.

To Reproduce

in a fresh environment:

devtools::install_github("LebeerLab/tidytacos")

or with a local clone of the repo:

devtools::install("path/to/tidytacos")

Expected behavior

No warnings

Screenshots

Version information (please complete the following information):

OS: [e.g. iOS, ubuntu, windows] Sys.info()["sysname"] macOS 14.5
R version [e.g. 4.1.2] R.version 4.3.1
tidytacos version [e.g. 0.2.2] packageVersion('tidytacos') 1.0.0 -- cloned from master branch

Additional context

[BUG] read_tidytacos error handling broken

Hi tidytacos team, I'm testing out the package as part of the JOSS review and will raise issues for any bugs I spot on the way, such as this little one below.

cheers
David

typo in read_tidytacos code means cryptic error appears if count table not found, instead of what you intended

library(tidytacos)
urt %>% write_tidytacos("temp_dir")
unlink("temp_dir/counts.csv")
read_tidytacos("temp_dir")
#> Error in paste("File", counts, ", containing count data not found in", : cannot coerce type 'closure' to vector of type 'character'

^{Created on 2024-07-25 with reprex v2.1.1}

Tacoplot_stack shows only one sample

When I use tacoplot_stack(ta), it shows the bar of only one sample called "NA", possibly all samples together?
Tacoplot_stack(ta, x=sample) gives the following error:
``Error in geom_bar():
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `lvls_reorder()`:
! `idx` must contain one integer for each level of `f`
Backtrace:

base (local) <fn>(x)
forcats::fct_reorder(sample_name, as.integer(sample_clustered))
forcats::lvls_reorder(f, order(summary, decreasing = .desc))
Error in geom_bar(stat = "identity") :
ℹ Error occurred in the 1st layer.
Caused by error in lvls_reorder():
! idx must contain one integer for each level of `f```

the plotly version also doesn't work

community guidelines missing from README

JOSS requirement:

There should be clear guidelines for third-parties wishing to:

Contribute to the software

Report issues or problems with the software

Seek support

https://joss.readthedocs.io/en/latest/review_criteria.html#community-guidelines

missing unit tests for some functions

I recommend having at least some unit tests for every function, and especially so for user-facing functions. It is not necessary to have 100% code coverage, but having minimal tests for user-facing functions helps safe guard against accidentally introducing API-breaking changes.

From looking at the codecov reports, it appears several functions lack any unit tests. These include:

from_dada
from_phyloseq
create_biom_header
to_biom
add_jervis_bardy
get_ord_stat
tacoplot_zoom
tacoplot_venn
tacoplot_venn_ly
tacoplot_euler
tacoplot_prevalences

[BUG] create_tidytacos example problems

create_tidytacos help page example doesn't create a usable tidytacos object, and doesn't give any guidance on how to continue.

library(tidytacos)

# example taken from create_tidytacos help page
x <- matrix(
  c(1500, 1300, 280, 356),
  ncol = 2
)
rownames(x) <- c("taxon1", "taxon2")
colnames(x) <- c("sample1", "sample2")

# Convert to tidytacos object
data <- create_tidytacos(x, taxa_are_columns = FALSE)

# tacoplot_stack has unstated requirements about the taxonomy table format?
data %>% tacoplot_stack()
#> Error in `mutate()`:
#> ℹ In argument: `best_classification = purrr::pmap_chr(...)`.
#> Caused by error in `ta$taxa[, rank_names]`:
#> ! Can't subset columns that don't exist.
#> ✖ Columns `kingdom`, `phylum`, `class`, `order`, `family`, etc. don't exist.

# not sure what is happening here, maybe tacoplot_ord doesn't like having only 2 samples?
data %>% tacoplot_ord(x = "sample_id")
#> Error in stats::cmdscale(dist_matrix, k = dims, eig = T, list = T, ...): 'k' must be in {1, 2, ..  n - 1}

# lastly, not sure what this bit is
tidytacos("a")
#> Error in tidytacos("a"): could not find function "tidytacos"

^{Created on 2024-07-26 with reprex v2.1.1}

recommend adding example data files for documentation

The quickstart explains how to load files into tidytacos:

taco <- read_tidytacos("/path/to/my_data")

but then proceeds to use the urt dataset for the rest of the tutorial, which is already a tidytacos object.

It would be helpful to have an example file in your package in inst/extdata and use it in the tutorial, so users will have a better understanding of the expected file format.

Additionally, it would be best practice to have code in data-raw to show how urt and leaf were created. https://r-pkgs.org/data.html#sec-data-data-raw

Add unifrac distance metric

Add a way to calculate unifrac distances

Determine a tree of the asvs using phangorn, eg:
https://cran.r-project.org/web/packages/phangorn/vignettes/Trees.html
Get Unifrac distance using said phylo tree
https://www.rdocumentation.org/packages/phyloseq/versions/1.16.2/topics/UniFrac

[BUG] R CMD Check failing on local clone of master branch

Describe the bug

R CMD Check is failing on a local clone of the master branch.

══ Documenting ════════════════════════════════════════════════════════════════════════════════════════════
ℹ Updating tidytacos documentation
ℹ Loading tidytacos
Error in loadNamespace(x) : there is no package called ‘SpiecEasi’

To Reproduce

Clone the repo, run devtools::install() followed by devtools::check() in a fresh environment.

Expected behavior

Check completing with no errors, warnings, or notes.

Screenshots

Version information (please complete the following information):

OS: [e.g. iOS, ubuntu, windows] Sys.info()["sysname"]
R version [e.g. 4.1.2] R.version
tidytacos version [e.g. 0.2.2] packageVersion('tidytacos')

Additional context

SpiecEasi is not in your DESCRIPTION file. I notice your gh actions workflow installs SpiecEasi along with other packages manually. Any of these packages that are actually used by the packages should be added to the DESCRIPTION as Imports or Suggests. I recommend removing them from the extra-packages part of your setup so you'll have a more accurate check workflow.

In case you need it, here's a guide on specifying how to install packages from github, bioconductor, etc in your DESCRIPTION file: https://cran.r-project.org/web/packages/devtools/vignettes/dependencies.html

[BUG] tacoplot_stack x argument doesn't accept string

tacoplot_stack x argument requests a string in the help doc, but won't accept one

library(tidytacos)
urt %>% tacoplot_stack(x = "participant")
#> Warning in tacoplot_stack(., x = "participant"): Sample labels not unique,
#> samples are aggregated.
#> Error in `geom_bar()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error in `forcats::fct_reorder()`:
#> ! length(f) == length(.x) is not TRUE

# As a side note: "Label not in sample table" seems like it should be an immediate error rather than a warning
urt %>% tacoplot_stack(x = participants)
#> Warning in tacoplot_stack(., x = participants): Label 'participants' not found
#> in the samples table.
#> Error in `pull()`:
#> Caused by error:
#> ! object 'participants' not found

^{Created on 2024-07-25 with reprex v2.1.1}

several function arguments not documented - and rcmdcheck warnings ignored

add_metadata has a metadata_tibble argument but this is documented as metadata

This should give you a warning when you run rcmdcheck so i'm not sure how your CI is passing...

Edit: so i checked your github actions setup and see you've configured it to only fail on errors. You have a lot of undocumented argument warnings https://github.com/LebeerLab/tidytacos/actions/runs/10097946221/job/27923934246#step:6:358

I can understand not wanting to fail on notes, as your use of tidy evaluation gives a lot of rogue notes, but ignoring these warnings isn't a good idea, e.g. here you miss documentation problems.

No examples for the filter/mutate/select functions

None of the functions in handlers.R have examples on their help pages.

https://github.com/LebeerLab/tidytacos/blob/master/R/handlers.R

You might like to combine a few of the related help pages together, e.g. all the filtering functions,

But I definitely think you need to give at least one working example for each function, or otherwise direct users who look in the help where to find examples elsewhere.

[BUG] tacoplot_ord silently ignores colouring by sample or sample_id

If this is intended behaviour I would expect it to warn me it will not colour each separate sample.

library(tidytacos)

urt %>% tacoplot_ord(x = sample)
#> Warning in tacoplot_ord(., x = sample): Empty samples detected, removing them
#> from the analysis

urt %>% tacoplot_ord(x = sample_id)
#> Warning in tacoplot_ord(., x = sample_id): Empty samples detected, removing
#> them from the analysis

^{Created on 2024-07-26 with reprex v2.1.1}

docs style: referring to function names

As a general rule, I recommend referring to specific functions as if they were proper names, followed by parentheses and enclosed in backticks. This helps make documentation more concise and easy to read.

For example, this sentence:

The add_total_count function will add total sample read numbers to the sample table.

Could be revised as:

add_total_count() will add total sample read numbers to the sample table.

As an added bonus, pkgdown will be able to turn that function reference into a hyperlink to its doc.

This is a soft recommendation.

[BUG] Sample clustering fails when "sample_name" columns exists

When a column called "sample_name" is present in the sample table, plotting a stack plot fails (produces a single stack called "NA" instead of one per sample).

Code that produces the error (tidytacos v1.0.0, R v4.3.2):

library(tidyverse)
library(tidytacos)

# read data
data <- read_rds("tacoplot_bug.rds")

# this fails (only one stack called "NA")
data %>% tacoplot_stack() 

# this fixes the problem! 
data$samples$sample_name <- NULL
data %>% tacoplot_stack()

I'm pretty sure it has to do with one of the counts_matrix function (or a similar one) using the "sample_name" column as rownames for the count matrix instead of the "sample_id" column.

The dataset: tacoplot_bug.zip

[BUG] version in DESCRIPTION is out of sync

Two problems:

The version in the DESCRIPTION file from the tagged release v1.0.1 states the version is 1.0.0. Looks like you forgot to bump the version before cutting the release.
Currently, the DESCRIPTION in the master branch states the version is 1.0.0. Since it is ahead of the most recent release, it should ideally be of the form 1.0.1.9XXX to mark it as a development version. I recommend doing this immediately after cutting a release. See R pkgs guide on post-release

Visualizing shared ASVs between conditions

I think a way to determine which ASVs are shared for certain conditions would be a nice addition to the package. I would find a table with overlaps between groups useful (proportion of ASVs that are shared) similar to the betas table. The ggVennDiagram package might also be useful for visualization (e.g. tacoplot_venn(ta, condition=location)), where the input needs to be a list of ASVs per condition.

Additionally, it would be nice to add an additional variable for subsetting samples. For example for the URT dataset: the proportion of shared ASVs between nose and nasopharynx per participant, then the additional argument for the function would be: "shared_within=participant" or something.

[BUG] Aggregate after trim ASVs

Trimming ASVs to identical sequences need to be aggregated

Add extra alpha div metrics

Is your feature request related to a problem? Please describe.
The existing add_alphas adds observed and inverse_simpson alpha diversity metrics. Would be nice to allow other types too

Describe the solution you'd like
An optional metric argument(?) to the add_alphas function, to allow picking of other metrics.

expanding on taxonlist_per_condition

Hi,

I like the concept of the taxonlist_per_condition function, but it doesn't seem to work for either the URT dataset or one of my own
list<-taxonlist_per_condition(urt,condition="location")
Error in any_samples_left(ta) : No samples left after filtering

Furthermore, I think it would be useful to immediately add the taxonomy of the taxon_ids to the tables. Finally, optionally, a read cut-off option would be nice, as I suspect some cross-contamination has occurred in my dataset and to exclude false positive presence of taxa I would set the read cut-off a bit higher at 3. Although the latter can also be achieved by filtering the counts table before executing the taxonlist_per_condition function.

[BUG] tacoplot_ord x argument requests string, but does the wrong thing with a string

I'd suggest changing the docs instead of the code here, and to give an example of use.

library(tidytacos)

urt %>% tacoplot_ord(x = "participant")
#> Warning in tacoplot_ord(., x = "participant"): Empty samples detected, removing
#> them from the analysis

^{Created on 2024-07-26 with reprex v2.1.1}

tacoplot_stack pie = TRUE suggestions

To avoid misuse I feel the pie = TRUE should error if more than one sample is supplied. Or at least have an example of reasonable use.

library(tidytacos)
urt %>%
  filter_samples(sample_id == "s169") %>%
  tacoplot_stack(pie = TRUE)

urt %>%
  filter_samples(participant == "CON83") %>%
  tacoplot_stack(pie = TRUE)

urt %>% 
  filter_samples(location == "N", method == "S") %>%
  tacoplot_stack(pie = TRUE)

^{Created on 2024-07-25 with reprex v2.1.1}

[BUG] rank_names function unexpected behaviour

rank_names should "Return rank names associated with a tidytacos object"

But really it just always returns c("kingdom", "phylum", "class", "order", "family", "genus")

Which is quite unexpected.

library(tidytacos)

x <- matrix(c(1500, 1300, 280, 356), ncol = 2)
rownames(x) <- c("taxon1", "taxon2")
colnames(x) <- c("sample1", "sample2")
my_taco <- create_tidytacos(x, taxa_are_columns = FALSE)
my_taco$taxa
#> # A tibble: 2 × 2
#>   taxon  taxon_id
#>   <chr>  <chr>   
#> 1 taxon1 t1      
#> 2 taxon2 t2

rank_names(my_taco)
#> [1] "kingdom" "phylum"  "class"   "order"   "family"  "genus"

rank_names is called indirectly by tacoplot_stack, which then causes this kind of problem

my_taco %>% tacoplot_stack()
#> Error in `mutate()`:
#> ℹ In argument: `best_classification = purrr::pmap_chr(...)`.
#> Caused by error in `ta$taxa[, rank_names]`:
#> ! Can't subset columns that don't exist.
#> ✖ Columns `kingdom`, `phylum`, `class`, `order`, `family`, etc. don't exist.

^{Created on 2024-07-26 with reprex v2.1.1}

at least one of the taxonomic rank names should be present in the taxon table [BUG]

When I run:

max_taxa <- 144
used_rank <- "class"
tidy_physeq %>%
  remove_empty_samples() %>%
  tidytacos::set_rank_names(
    rank_names = phyloseq::rank_names(physeq_18SP_no_singletons)
  ) %>%
  aggregate_taxa(rank = used_rank) %>%
  tidytacos::add_prevalence() %>%
  tidytacos::mutate_taxa(
    keep = min_rank(desc(occurrence)) < max_taxa
  ) %>%
  filter_taxa(
    keep,
    !is.na(class)
  ) %>%
  tidytacos::everything() %>%
  mutate(count = as.integer(count)) %>%
  select(taxon_id, sample_id, count, sample, Cmon_PlotID, Diepte,
         Landgebruik_MBAG, class, occurrence) %>%
  filter(
    complete.cases(.)
  )

I get:

Error in `aggregate_taxa()`:
! at least one of the taxonomic rank names should be present in the taxon table
Backtrace:
  1. ... %>% filter(complete.cases(.))
 20. tidytacos::aggregate_taxa(., rank = used_rank)
Execution halted

I think this might be related to the fact that we used the PR2 database for taxonomic assignment, which has an unusual taxonomic structure with 9 levels:

Domain / Supergroup / Division / Subdivision / Class / Order / Family / Genus / Species

Because when I use the above chunk of code for other primersets that were classified using a database with traditional taxonomic structure:

Phylum / Class / Order / Family / Genus / Species

I don't have this problem

In the help file of tidytacos::aggregate_taxa I read:

If the rank you are interested in is in the standard list, just supply it as an argument. * If not, delete all taxon variables except taxon_id and the ranks you are still interested in prior to calling this function

But I'm not sure what you mean with the standard list?

lebeerlab / tidytacos Goto Github PK

tidytacos's Introduction

tidytacos

Overview

Prerequisites

Installation

Documentation

tidytacos's People

Contributors

Stargazers

Watchers

Forkers

tidytacos's Issues

Recommend Projects

Recommend Topics

Recommend Org