Comments (9)
Thanks! Which workshop are you following?
Some comments
counts_tt <-
airway %>%
tidybulk() %>%
You don't need to transform to tidybulk from SummarizedExperiment necessarily.
counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)
You need to identify (or keep) abundant gene transcripts, before scaling
counts_tt %>% identify_abundant(factor_of_interest = ...) %>% scale_abundance(...)
This will execute
filterByExpr groups
Let us know how you go. Do the TMM match now?
from tidybulk.
Hello Stefano!
I started from this one: https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html
My goal is to see if I can use all the functions of edgeR and limma through tidybulk in order to reform
a workflow for smallRNAs.
I didn't use the identify_abundant
because I get:
counts_tt %>% identify_abundant(factor_of_interest = "dex")
Error in identify_abundant(., factor_of_interest = "dex") :
could not find function "identify_abundant"
My session info:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.3 airway_1.8.0 SummarizedExperiment_1.18.2
[4] DelayedArray_0.14.1 matrixStats_0.58.0 Biobase_2.48.0
[7] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2 IRanges_2.22.2
[10] S4Vectors_0.26.1 BiocGenerics_0.34.0 tidybulk_1.0.2
loaded via a namespace (and not attached):
[1] pillar_1.4.7 compiler_4.0.3 RColorBrewer_1.1-2
[4] BiocManager_1.30.10 XVector_0.28.0 zlibbioc_1.34.0
[7] bitops_1.0-6 tools_4.0.3 digest_0.6.27
[10] lattice_0.20-41 evaluate_0.14 lifecycle_0.2.0
[13] tibble_3.0.5 preprocessCore_1.50.0 pkgconfig_2.0.3
[16] rlang_0.4.10 Matrix_1.2-18 rstudioapi_0.13
[19] cli_2.3.0 DBI_1.1.1 yaml_2.2.1
[22] xfun_0.20 GenomeInfoDbData_1.2.3 knitr_1.31
[25] generics_0.1.0 vctrs_0.3.6 hms_1.0.0
[28] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2
[31] R6_2.5.0 fansi_0.4.2 rmarkdown_2.6
[34] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[37] magrittr_2.0.1 ellipsis_0.3.1 htmltools_0.5.1.1
[40] assertthat_0.2.1 utf8_1.1.4 RCurl_1.98-1.2
[43] crayon_1.4.0
from tidybulk.
Hello @ConYel,
I think you are using an old version.
I suggest to
- use R 4.1.x,
- install the last version of tidybulk (either from Bioconductor or from github master)
- look at most recent workshop
- https://github.com/tidytranscriptomics-workshops
Let us know how you go.
from tidybulk.
Hello @stemangiola
I updated everything as you suggested but still it doesn't give the same results.
my new session:
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] tibble_3.1.5 tidyr_1.1.4 airway_1.14.0
[4] tidySummarizedExperiment_1.4.1 SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0 IRanges_2.28.0
[10] S4Vectors_0.32.0 BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.61.0 edgeR_3.36.0 limma_3.50.0
[16] stringr_1.4.0 tidybulk_1.6.1 dplyr_1.0.7
loaded via a namespace (and not attached):
[1] httr_1.4.2 viridisLite_0.4.0 jsonlite_1.7.2 bit64_4.0.5
[5] assertthat_0.2.1 BiocManager_1.30.16 blob_1.2.2 GenomeInfoDbData_1.2.7
[9] yaml_2.2.1 pillar_1.6.4 RSQLite_2.2.8 lattice_0.20-45
[13] glue_1.4.2 digest_0.6.28 RColorBrewer_1.1-2 XVector_0.34.0
[17] colorspace_2.0-2 htmltools_0.5.2 preprocessCore_1.56.0 Matrix_1.3-4
[21] pkgconfig_2.0.3 zlibbioc_1.40.0 purrr_0.3.4 scales_1.1.1
[25] tzdb_0.2.0 KEGGREST_1.34.0 generics_0.1.1 ggplot2_3.3.5
[29] ellipsis_0.3.2 cachem_1.0.6 lazyeval_0.2.2 cli_3.1.0
[33] magrittr_2.0.1 crayon_1.4.2 memoise_2.0.0 evaluate_0.14
[37] fansi_0.5.0 data.table_1.14.2 tools_4.1.2 org.Hs.eg.db_3.14.0
[41] hms_1.1.1 lifecycle_1.0.1 plotly_4.10.0 munsell_0.5.0
[45] locfit_1.5-9.4 DelayedArray_0.20.0 AnnotationDbi_1.56.1 Biostrings_2.62.0
[49] compiler_4.1.2 rlang_0.4.12 grid_4.1.2 RCurl_1.98-1.5
[53] rstudioapi_0.13 htmlwidgets_1.5.4 bitops_1.0-7 rmarkdown_2.11
[57] gtable_0.3.0 DBI_1.1.1 R6_2.5.1 knitr_1.36
[61] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 readr_2.0.2
[65] stringi_1.7.5 parallel_4.1.2 Rcpp_1.0.7 vctrs_0.3.8
[69] png_0.1-7 tidyselect_1.1.1 xfun_0.27
Results
counts_scaled %>% group_by(sample, TMM, multiplier ) %>% summarise()
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
`summarise()` has grouped output by 'sample', 'TMM'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups: sample, TMM [8]
sample TMM multiplier
<chr> <dbl> <dbl>
1 SRR1039508 1.06 1.42
2 SRR1039509 1.02 1.60
3 SRR1039512 0.987 1.23
4 SRR1039513 0.951 2.14
5 SRR1039516 1.03 1.22
6 SRR1039517 0.975 1.03
7 SRR1039520 1.02 1.58
8 SRR1039521 0.959 1.52
Warning message:
In is_sample_feature_deprecated_used(.data, (enquos(..., .ignore_empty = "all") %>% :
tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)
dgList_norm$samples %>% arrange(norm.factors) %>% select(norm.factors)
norm.factors
SRR1039513 0.9484696
SRR1039521 0.9538599
SRR1039517 0.9779832
SRR1039512 0.9904567
SRR1039509 1.0212137
SRR1039520 1.0269341
SRR1039516 1.0309324
SRR1039508 1.0554452
counts_scaled %>%
+ select(feature, sample, counts, counts_scaled) %>%
+ mutate( log_scl = log10(counts_scaled + 4)) %>%
+ filter( feature %in% c("ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000971", "ENSG00000001036")) %>%
+ select(-c(counts, counts_scaled)) %>%
+ pivot_wider(names_from = sample, values_from = log_scl)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 6 x 9
feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 2.98 2.86 3.03 2.94 3.14 3.03 3.09 2.94
2 ENSG00000000419 2.82 2.92 2.89 2.89 2.86 2.92 2.82 2.89
3 ENSG00000000457 2.57 2.53 2.52 2.55 2.48 2.54 2.57 2.55
4 ENSG00000000460 1.95 1.96 1.73 1.90 2.00 1.84 2.09 1.98
5 ENSG00000000971 3.66 3.77 3.88 3.96 3.92 4.05 3.91 4.08
6 ENSG00000001036 3.31 3.23 3.33 3.28 3.24 3.17 3.33 3.23
lcpm_n %>% head
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 4.968465 4.551329 5.1257352 4.832653 5.501732 5.124204 5.298663 4.831056
ENSG00000000419 4.430383 4.751672 4.6358756 4.672524 4.549387 4.735381 4.416679 4.660451
ENSG00000000457 3.590323 3.471364 3.4035325 3.524635 3.296633 3.470985 3.581820 3.517297
ENSG00000000460 1.510814 1.564892 0.7542605 1.337909 1.673791 1.127202 1.988089 1.616999
ENSG00000000971 7.224535 7.584134 7.9453477 8.209972 8.061492 8.517492 8.044721 8.631768
ENSG00000001036 6.043806 5.793293 6.1130816 5.940744 5.824597 5.581981 6.116828 5.783987
from tidybulk.
Hello @ConYel
I think it depends on the reference sample that tidybulk and your other code are choosing.
reference_sample | A character string. The name of the reference sample. If NULL the sample with highest total read count will be selected as reference.
library(tidybulk)
data("se_mini")
# Just getting a tibble out of a SE for testing purposes
input_df = se_mini %>% tidybulk() %>% as_tibble() %>% setNames(c("b","a", "c", "Cell type", "time" , "condition"))
input_df %>%
identify_abundant(a, b, c) %>%
scale_abundance(
.sample = a,
.transcript = b,
.abundance = c,
reference_sample = "SRR1740035",
action = "only"
)
# A tibble: 5 x 3
a TMM multiplier
<fct> <dbl> <dbl>
1 SRR1740034 0.871 1.30
2 SRR1740035 0.849 1.18
3 SRR1740043 0.843 2.70
4 SRR1740058 1.13 0.970
5 SRR1740067 1.42 1.83
Now, from input_df
try to use your edgeR code to get the norm.factors
choosing the same reference sample.
input_df %>%
rename(counts = c) %>%
as_SummarizedExperiment(a,b,counts) %>%
keep_abundant() %>%
edgeR::calcNormFactors(refColumn=2)
$counts
SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
ACAP1 7151 8028 1470 12264 616
ACP5 2278 2666 350 489 264
ADRB2 298 368 116 153 722
AIM2 3050 3004 304 56 357
ALOX5 10064 11521 9457 118 7852
177 more rows ...
$samples
group lib.size norm.factors Cell.type time condition
SRR1740034 1 520515 0.8709091 b_cell 0 d TRUE
SRR1740035 1 589046 0.8488030 b_cell 1 d TRUE
SRR1740043 1 258771 0.8431936 monocyte 1 d FALSE
SRR1740058 1 536818 1.1309222 t_cell 0 d TRUE
SRR1740067 1 227024 1.4186009 dendritic_myeloid 1 d FALSE
The TMM factor match, we then calculate scaled counts. The final counts may vary up to a scalar depending on what reference you decide, and small details. Nonetheless, if you plot tidybulk scaled counts, with your own scaled counts (provided the same TMM) should sit at a 45' line. The rest of the analyses will provide the same outcome.
from tidybulk.
Hello @stemangiola
Thank you for the clarification!
I followed your advice and I managed to get the same norm.factors
and TMM
values using the same sample as reference.
I saw that there was reference_selection_function
which is Deprecated.
In the calcNormFactors
it is used:
If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.
Did reference_selection_function
had the same option it has edgeR when refColumn is NULL
?
Is it possible to include it as a default (for NULL) or as an option, or as a warning in which it will inform which sample the scale_abundance()
is using as reference?
Thank you for your time!
from tidybulk.
If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.
This would make sense. However, while for modelling does not really matter which reference you pick, for calculating scaled counts for visualisation, and other statistics, we would lose some resolution for lowly transcribed transcripts in case a sample was deeply sequenced. If we pick the most deeply sequenced as reference we would not lose any resolution, at the cost to insert some quantization random error for the shallow sequenced.
I would need a more compelling argument, but it is a good point of discussion.
or as a warning in which it will inform which sample the
scale_abundance()
is using as reference
This is less impactful. I will implement this as a message.
from tidybulk.
Thanks @ConYel for your issue. I implemented your suggestion. Feel free to reopen if necessarily.
from tidybulk.
Thank you @stemangiola for the clarification!
I totally understand your point!
My thinking was more about "backward compatibility" with the original function.
In order to also keep reproducibility between workflows, in case someone wants to move from the original to the tidy way.
from tidybulk.
Related Issues (20)
- for SE save internals within the metadata rather than attributes.
- github action mac fail HOT 1
- DEG Analysis Between Single Group with Multiple Groups HOT 6
- "Proportion of variance explained" values on reduce_dimensions() output HOT 1
- github actions give preflight error for the whole tidy stack HOT 1
- ERROR; return code from pthread_create() is 22 HOT 3
- pathway analysis using tidybulk HOT 8
- Design and contrasts for tidybulk test differential abundance function HOT 9
- Remove batch effect in test_differential_abundance
- Warning in CHECK, mistery
- test
- Improve the documentation for `tidybulk` HOT 29
- turn pseudobulk from scRNAseq with logCPM value back to a seurat object with counts value? HOT 2
- `tidybulk` Remove hard dependency for `DESeq2` HOT 12
- in `test_gene_rank` check if any feature is NA, because pivot_feature will fail
- the attribute "internals" $ tt_columns abundance_scaled, carries the environment with it inflating the RDS file size
- Prepare for upcoming Seurat v5 release
- Add the method to simplify complete confounders is they are not factor of interest
- limma:MakeContrasts integration issues HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tidybulk.