Giter Club home page Giter Club logo

tidytacos's Introduction

tidytacos

R-CMD-check codecov

Overview

Tidytacos (tidy TAxonomic COmpositionS) is an R package for the exploration of microbial community data. Such community data consists of read counts generated by amplicon sequencing (e.g. a region of the 16S rRNA gene) or metagenome (shotgun) sequencing. Each read count represents a number of sequencing reads identified for some taxon (an ASV, OTU, species, or higher-level taxon) in a sample.

Tidytacos builds on the tidyverse created by Hadley Wickham: the data are stored in tidy tables where each row is an observation and each column a variable. In addition, the package supplies a set of "verbs": functions that take a tidytacos object as first argument and also return a tidytacos object. This makes it easy to construct "pipe chains" of code that represent series of operations performed on the tidytacos object.

Prerequisites

Tidytacos is an R package. You can find instructions to download and install R here.

Tidytacos relies on the tidyverse R package (or, more accurately, set of R packages). You can install the tidyverse by running the following R code:

install.packages("tidyverse")

Finally, RStudio is a nice IDE to work with R code (as well as code in other scripting languages). It has a lot more features than what the default R IDE allows: beyond creating and saving scripts, it also shows your figures, allows you to navigate files, allows you to inspect tables etc. You can download RStudio here.

Installation

Run the following R code to install the latest version of tidytacos:

install.packages("devtools")
devtools::install_github("LebeerLab/tidytacos")

Documentation

A documentation page (help page) is available for all functions in the browser or in R. You can view it in R by running e.g. ?filter_samples. Some useful tutorials can be found on the wiki.

tidytacos's People

Contributors

swittouck avatar theoafidian avatar wsmets avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

justicengom

tidytacos's Issues

tacoplot_alphas: samples with missing alpha diversity values removed only at plotting stage

tacoplot_ord notifies the user and drops empty samples up front, maybe tacoplot_alphas could do something similar?

When I applied tacoplot_alphas directly on a tidytacos object it was quite mysterious to have samples dropped from the plot.

Or maybe a warning in add_alpha is also a good idea, saying something like: N samples were empty, returning NA for those samples.

library(tidytacos)
urt_s <- urt %>% 
  filter_samples(method == "S") %>% 
  add_alphas(methods = "shannon") 


urt_s %>% tacoplot_alphas(group_by = location)
#> Warning: Removed 2 rows containing non-finite outside the scale range
#> (`stat_ydensity()`).
#> Warning: Removed 2 rows containing missing values or values outside the scale range
#> (`geom_point()`).

sum(is.na(urt_s$samples$shannon))
#> [1] 2

Created on 2024-07-26 with reprex v2.1.1

docs need instructions for users to import their own raw data

The quick start guide directs users to the source code to learn about more options for importing data.

More options to import and convert your data can be found here.

I strongly discourage this. Novice users should not be expected to read the source code to learn how to use the package. This information should be distilled into a vignette for users to read.

(Originally posted by @kelly-sovacool in #50 (comment))

[BUG] minor warnings during R CMD build

Describe the bug

R CMD build emits warnings during installation:

── R CMD build ────────────────────────────────────────────────────────────────────────────────────────────
✔  checking for file ‘/private/var/folders/yb/f86k3qcj10nfyx57x30_dgnnsds0t8/T/RtmpTemN5J/remotes178f71dc179bf/LebeerLab-tidytacos-61b451d/DESCRIPTION’ ...
─  preparing ‘tidytacos’:
─  checking DESCRIPTION meta-information ...Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
     It is recommended to use ‘given’ instead of ‘middle’.
    OK
   Warning in person1(given = given[[i]], family = family[[i]], middle = middle[[i]],  :
     It is recommended to use ‘given’ instead of ‘middle’.
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘tidytacos_1.0.0.tar.gz’
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'

Installation still proceeds successfully, so this is not a blocking problem.

To Reproduce

in a fresh environment:

devtools::install_github("LebeerLab/tidytacos")

or with a local clone of the repo:

devtools::install("path/to/tidytacos")

Expected behavior

No warnings

Screenshots

Version information (please complete the following information):

  • OS: [e.g. iOS, ubuntu, windows] Sys.info()["sysname"] macOS 14.5
  • R version [e.g. 4.1.2] R.version 4.3.1
  • tidytacos version [e.g. 0.2.2] packageVersion('tidytacos') 1.0.0 -- cloned from master branch

Additional context

[BUG] read_tidytacos error handling broken

Hi tidytacos team, I'm testing out the package as part of the JOSS review and will raise issues for any bugs I spot on the way, such as this little one below.

cheers
David

typo in read_tidytacos code means cryptic error appears if count table not found, instead of what you intended

library(tidytacos)
urt %>% write_tidytacos("temp_dir")
unlink("temp_dir/counts.csv")
read_tidytacos("temp_dir")
#> Error in paste("File", counts, ", containing count data not found in", : cannot coerce type 'closure' to vector of type 'character'

Created on 2024-07-25 with reprex v2.1.1

Tacoplot_stack shows only one sample

When I use tacoplot_stack(ta), it shows the bar of only one sample called "NA", possibly all samples together?
Tacoplot_stack(ta, x=sample) gives the following error:
``Error in geom_bar():
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error in `lvls_reorder()`:
! `idx` must contain one integer for each level of `f`
Backtrace:

  1. base (local) <fn>(x)
  2. forcats::fct_reorder(sample_name, as.integer(sample_clustered))
  3. forcats::lvls_reorder(f, order(summary, decreasing = .desc))
    Error in geom_bar(stat = "identity") :
    ℹ Error occurred in the 1st layer.
    Caused by error in lvls_reorder():
    ! idx must contain one integer for each level of `f```

the plotly version also doesn't work

missing unit tests for some functions

I recommend having at least some unit tests for every function, and especially so for user-facing functions. It is not necessary to have 100% code coverage, but having minimal tests for user-facing functions helps safe guard against accidentally introducing API-breaking changes.

From looking at the codecov reports, it appears several functions lack any unit tests. These include:

  • from_dada
  • from_phyloseq
  • create_biom_header
  • to_biom
  • add_jervis_bardy
  • get_ord_stat
  • tacoplot_zoom
  • tacoplot_venn
  • tacoplot_venn_ly
  • tacoplot_euler
  • tacoplot_prevalences

[BUG] create_tidytacos example problems

create_tidytacos help page example doesn't create a usable tidytacos object, and doesn't give any guidance on how to continue.

library(tidytacos)

# example taken from create_tidytacos help page
x <- matrix(
  c(1500, 1300, 280, 356),
  ncol = 2
)
rownames(x) <- c("taxon1", "taxon2")
colnames(x) <- c("sample1", "sample2")

# Convert to tidytacos object
data <- create_tidytacos(x, taxa_are_columns = FALSE)

# tacoplot_stack has unstated requirements about the taxonomy table format?
data %>% tacoplot_stack()
#> Error in `mutate()`:
#> ℹ In argument: `best_classification = purrr::pmap_chr(...)`.
#> Caused by error in `ta$taxa[, rank_names]`:
#> ! Can't subset columns that don't exist.
#> ✖ Columns `kingdom`, `phylum`, `class`, `order`, `family`, etc. don't exist.

# not sure what is happening here, maybe tacoplot_ord doesn't like having only 2 samples?
data %>% tacoplot_ord(x = "sample_id")
#> Error in stats::cmdscale(dist_matrix, k = dims, eig = T, list = T, ...): 'k' must be in {1, 2, ..  n - 1}

# lastly, not sure what this bit is
tidytacos("a")
#> Error in tidytacos("a"): could not find function "tidytacos"

Created on 2024-07-26 with reprex v2.1.1

recommend adding example data files for documentation

The quickstart explains how to load files into tidytacos:

taco <- read_tidytacos("/path/to/my_data")

but then proceeds to use the urt dataset for the rest of the tutorial, which is already a tidytacos object.

It would be helpful to have an example file in your package in inst/extdata and use it in the tutorial, so users will have a better understanding of the expected file format.

Additionally, it would be best practice to have code in data-raw to show how urt and leaf were created. https://r-pkgs.org/data.html#sec-data-data-raw

[BUG] R CMD Check failing on local clone of master branch

Describe the bug

R CMD Check is failing on a local clone of the master branch.

══ Documenting ════════════════════════════════════════════════════════════════════════════════════════════
ℹ Updating tidytacos documentation
ℹ Loading tidytacos
Error in loadNamespace(x) : there is no package called ‘SpiecEasi’

To Reproduce

Clone the repo, run devtools::install() followed by devtools::check() in a fresh environment.

Expected behavior

Check completing with no errors, warnings, or notes.

Screenshots

Version information (please complete the following information):

  • OS: [e.g. iOS, ubuntu, windows] Sys.info()["sysname"]
  • R version [e.g. 4.1.2] R.version
  • tidytacos version [e.g. 0.2.2] packageVersion('tidytacos')

Additional context

SpiecEasi is not in your DESCRIPTION file. I notice your gh actions workflow installs SpiecEasi along with other packages manually. Any of these packages that are actually used by the packages should be added to the DESCRIPTION as Imports or Suggests. I recommend removing them from the extra-packages part of your setup so you'll have a more accurate check workflow.

In case you need it, here's a guide on specifying how to install packages from github, bioconductor, etc in your DESCRIPTION file: https://cran.r-project.org/web/packages/devtools/vignettes/dependencies.html

[BUG] tacoplot_stack x argument doesn't accept string

tacoplot_stack x argument requests a string in the help doc, but won't accept one

library(tidytacos)
urt %>% tacoplot_stack(x = "participant")
#> Warning in tacoplot_stack(., x = "participant"): Sample labels not unique,
#> samples are aggregated.
#> Error in `geom_bar()`:
#> ! Problem while computing aesthetics.
#> ℹ Error occurred in the 1st layer.
#> Caused by error in `forcats::fct_reorder()`:
#> ! length(f) == length(.x) is not TRUE

# As a side note: "Label not in sample table" seems like it should be an immediate error rather than a warning
urt %>% tacoplot_stack(x = participants)
#> Warning in tacoplot_stack(., x = participants): Label 'participants' not found
#> in the samples table.
#> Error in `pull()`:
#> Caused by error:
#> ! object 'participants' not found

Created on 2024-07-25 with reprex v2.1.1

several function arguments not documented - and rcmdcheck warnings ignored

add_metadata has a metadata_tibble argument but this is documented as metadata

This should give you a warning when you run rcmdcheck so i'm not sure how your CI is passing...

Edit: so i checked your github actions setup and see you've configured it to only fail on errors. You have a lot of undocumented argument warnings https://github.com/LebeerLab/tidytacos/actions/runs/10097946221/job/27923934246#step:6:358

I can understand not wanting to fail on notes, as your use of tidy evaluation gives a lot of rogue notes, but ignoring these warnings isn't a good idea, e.g. here you miss documentation problems.

[BUG] tacoplot_ord silently ignores colouring by sample or sample_id

If this is intended behaviour I would expect it to warn me it will not colour each separate sample.

library(tidytacos)

urt %>% tacoplot_ord(x = sample)
#> Warning in tacoplot_ord(., x = sample): Empty samples detected, removing them
#> from the analysis

urt %>% tacoplot_ord(x = sample_id)
#> Warning in tacoplot_ord(., x = sample_id): Empty samples detected, removing
#> them from the analysis

Created on 2024-07-26 with reprex v2.1.1

docs style: referring to function names

As a general rule, I recommend referring to specific functions as if they were proper names, followed by parentheses and enclosed in backticks. This helps make documentation more concise and easy to read.

For example, this sentence:

The add_total_count function will add total sample read numbers to the sample table.

Could be revised as:

add_total_count() will add total sample read numbers to the sample table.

As an added bonus, pkgdown will be able to turn that function reference into a hyperlink to its doc.

This is a soft recommendation.

[BUG] Sample clustering fails when "sample_name" columns exists

When a column called "sample_name" is present in the sample table, plotting a stack plot fails (produces a single stack called "NA" instead of one per sample).

Code that produces the error (tidytacos v1.0.0, R v4.3.2):

library(tidyverse)
library(tidytacos)

# read data
data <- read_rds("tacoplot_bug.rds")

# this fails (only one stack called "NA")
data %>% tacoplot_stack() 

# this fixes the problem! 
data$samples$sample_name <- NULL
data %>% tacoplot_stack() 

I'm pretty sure it has to do with one of the counts_matrix function (or a similar one) using the "sample_name" column as rownames for the count matrix instead of the "sample_id" column.

The dataset: tacoplot_bug.zip

[BUG] version in DESCRIPTION is out of sync

Two problems:

  • The version in the DESCRIPTION file from the tagged release v1.0.1 states the version is 1.0.0. Looks like you forgot to bump the version before cutting the release.
  • Currently, the DESCRIPTION in the master branch states the version is 1.0.0. Since it is ahead of the most recent release, it should ideally be of the form 1.0.1.9XXX to mark it as a development version. I recommend doing this immediately after cutting a release. See R pkgs guide on post-release

Visualizing shared ASVs between conditions

I think a way to determine which ASVs are shared for certain conditions would be a nice addition to the package. I would find a table with overlaps between groups useful (proportion of ASVs that are shared) similar to the betas table. The ggVennDiagram package might also be useful for visualization (e.g. tacoplot_venn(ta, condition=location)), where the input needs to be a list of ASVs per condition.

Additionally, it would be nice to add an additional variable for subsetting samples. For example for the URT dataset: the proportion of shared ASVs between nose and nasopharynx per participant, then the additional argument for the function would be: "shared_within=participant" or something.

Add extra alpha div metrics

Is your feature request related to a problem? Please describe.
The existing add_alphas adds observed and inverse_simpson alpha diversity metrics. Would be nice to allow other types too

Describe the solution you'd like
An optional metric argument(?) to the add_alphas function, to allow picking of other metrics.

expanding on taxonlist_per_condition

Hi,

I like the concept of the taxonlist_per_condition function, but it doesn't seem to work for either the URT dataset or one of my own
list<-taxonlist_per_condition(urt,condition="location")
Error in any_samples_left(ta) : No samples left after filtering

Furthermore, I think it would be useful to immediately add the taxonomy of the taxon_ids to the tables. Finally, optionally, a read cut-off option would be nice, as I suspect some cross-contamination has occurred in my dataset and to exclude false positive presence of taxa I would set the read cut-off a bit higher at 3. Although the latter can also be achieved by filtering the counts table before executing the taxonlist_per_condition function.

tacoplot_stack pie = TRUE suggestions

To avoid misuse I feel the pie = TRUE should error if more than one sample is supplied. Or at least have an example of reasonable use.

library(tidytacos)
urt %>%
  filter_samples(sample_id == "s169") %>%
  tacoplot_stack(pie = TRUE)

urt %>%
  filter_samples(participant == "CON83") %>%
  tacoplot_stack(pie = TRUE)

urt %>% 
  filter_samples(location == "N", method == "S") %>%
  tacoplot_stack(pie = TRUE)

Created on 2024-07-25 with reprex v2.1.1

[BUG] rank_names function unexpected behaviour

rank_names should "Return rank names associated with a tidytacos object"

But really it just always returns c("kingdom", "phylum", "class", "order", "family", "genus")

Which is quite unexpected.

library(tidytacos)

x <- matrix(c(1500, 1300, 280, 356), ncol = 2)
rownames(x) <- c("taxon1", "taxon2")
colnames(x) <- c("sample1", "sample2")
my_taco <- create_tidytacos(x, taxa_are_columns = FALSE)
my_taco$taxa
#> # A tibble: 2 × 2
#>   taxon  taxon_id
#>   <chr>  <chr>   
#> 1 taxon1 t1      
#> 2 taxon2 t2

rank_names(my_taco)
#> [1] "kingdom" "phylum"  "class"   "order"   "family"  "genus"

rank_names is called indirectly by tacoplot_stack, which then causes this kind of problem

my_taco %>% tacoplot_stack()
#> Error in `mutate()`:
#> ℹ In argument: `best_classification = purrr::pmap_chr(...)`.
#> Caused by error in `ta$taxa[, rank_names]`:
#> ! Can't subset columns that don't exist.
#> ✖ Columns `kingdom`, `phylum`, `class`, `order`, `family`, etc. don't exist.

Created on 2024-07-26 with reprex v2.1.1

at least one of the taxonomic rank names should be present in the taxon table [BUG]

When I run:

max_taxa <- 144
used_rank <- "class"
tidy_physeq %>%
  remove_empty_samples() %>%
  tidytacos::set_rank_names(
    rank_names = phyloseq::rank_names(physeq_18SP_no_singletons)
  ) %>%
  aggregate_taxa(rank = used_rank) %>%
  tidytacos::add_prevalence() %>%
  tidytacos::mutate_taxa(
    keep = min_rank(desc(occurrence)) < max_taxa
  ) %>%
  filter_taxa(
    keep,
    !is.na(class)
  ) %>%
  tidytacos::everything() %>%
  mutate(count = as.integer(count)) %>%
  select(taxon_id, sample_id, count, sample, Cmon_PlotID, Diepte,
         Landgebruik_MBAG, class, occurrence) %>%
  filter(
    complete.cases(.)
  )

I get:

Error in `aggregate_taxa()`:
! at least one of the taxonomic rank names should be present in the taxon table
Backtrace:
  1. ... %>% filter(complete.cases(.))
 20. tidytacos::aggregate_taxa(., rank = used_rank)
Execution halted

I think this might be related to the fact that we used the PR2 database for taxonomic assignment, which has an unusual taxonomic structure with 9 levels:

Domain / Supergroup / Division / Subdivision / Class / Order / Family / Genus / Species

Because when I use the above chunk of code for other primersets that were classified using a database with traditional taxonomic structure:

Phylum / Class / Order / Family / Genus / Species

I don't have this problem

In the help file of tidytacos::aggregate_taxa I read:

  • If the rank you are interested in is in the standard list, just supply it as an argument. * If not, delete all taxon variables except taxon_id and the ranks you are still interested in prior to calling this function

But I'm not sure what you mean with the standard list?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.