Dear Stefano, thank you much for this very useful package! I would like to use

Hello Stefano! I started from this one: <a href="https://stemangiola.github.io/bio

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Normalization method (scaling) about tidybulk HOT 9 CLOSED

ConYel commented on June 20, 2024 1

Normalization method (scaling)

from tidybulk.

Comments (9)

stemangiola commented on June 20, 2024

Thanks! Which workshop are you following?

Some comments

counts_tt <-
airway %>%
tidybulk() %>%

You don't need to transform to tidybulk from SummarizedExperiment necessarily.

counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)

You need to identify (or keep) abundant gene transcripts, before scaling

counts_tt %>% identify_abundant(factor_of_interest = ...) %>% scale_abundance(...)

This will execute

filterByExpr groups

Let us know how you go. Do the TMM match now?

from tidybulk.

ConYel commented on June 20, 2024

Hello Stefano!
I started from this one: https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html
My goal is to see if I can use all the functions of edgeR and limma through tidybulk in order to reform
a workflow for smallRNAs.

I didn't use the identify_abundant
because I get:

counts_tt %>% identify_abundant(factor_of_interest = "dex")
Error in identify_abundant(., factor_of_interest = "dex") : 
  could not find function "identify_abundant"

My session info:

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.0.3                 airway_1.8.0                SummarizedExperiment_1.18.2
 [4] DelayedArray_0.14.1         matrixStats_0.58.0          Biobase_2.48.0             
 [7] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2             
[10] S4Vectors_0.26.1            BiocGenerics_0.34.0         tidybulk_1.0.2             

loaded via a namespace (and not attached):
 [1] pillar_1.4.7           compiler_4.0.3         RColorBrewer_1.1-2    
 [4] BiocManager_1.30.10    XVector_0.28.0         zlibbioc_1.34.0       
 [7] bitops_1.0-6           tools_4.0.3            digest_0.6.27         
[10] lattice_0.20-41        evaluate_0.14          lifecycle_0.2.0       
[13] tibble_3.0.5           preprocessCore_1.50.0  pkgconfig_2.0.3       
[16] rlang_0.4.10           Matrix_1.2-18          rstudioapi_0.13       
[19] cli_2.3.0              DBI_1.1.1              yaml_2.2.1            
[22] xfun_0.20              GenomeInfoDbData_1.2.3 knitr_1.31            
[25] generics_0.1.0         vctrs_0.3.6            hms_1.0.0             
[28] grid_4.0.3             tidyselect_1.1.0       glue_1.4.2            
[31] R6_2.5.0               fansi_0.4.2            rmarkdown_2.6         
[34] purrr_0.3.4            readr_1.4.0            tidyr_1.1.2           
[37] magrittr_2.0.1         ellipsis_0.3.1         htmltools_0.5.1.1     
[40] assertthat_0.2.1       utf8_1.1.4             RCurl_1.98-1.2        
[43] crayon_1.4.0

from tidybulk.

stemangiola commented on June 20, 2024

Hello @ConYel,

I think you are using an old version.

I suggest to

use R 4.1.x,
install the last version of tidybulk (either from Bioconductor or from github master)
look at most recent workshop
https://github.com/tidytranscriptomics-workshops

Let us know how you go.

from tidybulk.

ConYel commented on June 20, 2024

Hello @stemangiola
I updated everything as you suggested but still it doesn't give the same results.

my new session:

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tibble_3.1.5                   tidyr_1.1.4                    airway_1.14.0                 
 [4] tidySummarizedExperiment_1.4.1 SummarizedExperiment_1.24.0    Biobase_2.54.0                
 [7] GenomicRanges_1.46.0           GenomeInfoDb_1.30.0            IRanges_2.28.0                
[10] S4Vectors_0.32.0               BiocGenerics_0.40.0            MatrixGenerics_1.6.0          
[13] matrixStats_0.61.0             edgeR_3.36.0                   limma_3.50.0                  
[16] stringr_1.4.0                  tidybulk_1.6.1                 dplyr_1.0.7                   

loaded via a namespace (and not attached):
 [1] httr_1.4.2             viridisLite_0.4.0      jsonlite_1.7.2         bit64_4.0.5           
 [5] assertthat_0.2.1       BiocManager_1.30.16    blob_1.2.2             GenomeInfoDbData_1.2.7
 [9] yaml_2.2.1             pillar_1.6.4           RSQLite_2.2.8          lattice_0.20-45       
[13] glue_1.4.2             digest_0.6.28          RColorBrewer_1.1-2     XVector_0.34.0        
[17] colorspace_2.0-2       htmltools_0.5.2        preprocessCore_1.56.0  Matrix_1.3-4          
[21] pkgconfig_2.0.3        zlibbioc_1.40.0        purrr_0.3.4            scales_1.1.1          
[25] tzdb_0.2.0             KEGGREST_1.34.0        generics_0.1.1         ggplot2_3.3.5         
[29] ellipsis_0.3.2         cachem_1.0.6           lazyeval_0.2.2         cli_3.1.0             
[33] magrittr_2.0.1         crayon_1.4.2           memoise_2.0.0          evaluate_0.14         
[37] fansi_0.5.0            data.table_1.14.2      tools_4.1.2            org.Hs.eg.db_3.14.0   
[41] hms_1.1.1              lifecycle_1.0.1        plotly_4.10.0          munsell_0.5.0         
[45] locfit_1.5-9.4         DelayedArray_0.20.0    AnnotationDbi_1.56.1   Biostrings_2.62.0     
[49] compiler_4.1.2         rlang_0.4.12           grid_4.1.2             RCurl_1.98-1.5        
[53] rstudioapi_0.13        htmlwidgets_1.5.4      bitops_1.0-7           rmarkdown_2.11        
[57] gtable_0.3.0           DBI_1.1.1              R6_2.5.1               knitr_1.36            
[61] fastmap_1.1.0          bit_4.0.4              utf8_1.2.2             readr_2.0.2           
[65] stringi_1.7.5          parallel_4.1.2         Rcpp_1.0.7             vctrs_0.3.8           
[69] png_0.1-7              tidyselect_1.1.1       xfun_0.27

Results

counts_scaled %>%  group_by(sample, TMM, multiplier ) %>% summarise()
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
`summarise()` has grouped output by 'sample', 'TMM'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups:   sample, TMM [8]
  sample       TMM multiplier
  <chr>      <dbl>      <dbl>
1 SRR1039508 1.06        1.42
2 SRR1039509 1.02        1.60
3 SRR1039512 0.987       1.23
4 SRR1039513 0.951       2.14
5 SRR1039516 1.03        1.22
6 SRR1039517 0.975       1.03
7 SRR1039520 1.02        1.58
8 SRR1039521 0.959       1.52
Warning message:
In is_sample_feature_deprecated_used(.data, (enquos(..., .ignore_empty = "all") %>%  :
  tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)
  
dgList_norm$samples %>% arrange(norm.factors) %>% select(norm.factors)
           norm.factors
SRR1039513    0.9484696
SRR1039521    0.9538599
SRR1039517    0.9779832
SRR1039512    0.9904567
SRR1039509    1.0212137
SRR1039520    1.0269341
SRR1039516    1.0309324
SRR1039508    1.0554452


counts_scaled %>% 
+ select(feature, sample, counts, counts_scaled) %>% 
+ mutate( log_scl = log10(counts_scaled + 4)) %>% 
+ filter( feature   %in% c("ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000971", "ENSG00000001036")) %>%  
+ select(-c(counts, counts_scaled)) %>% 
+ pivot_wider(names_from = sample, values_from = log_scl)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 6 x 9
  feature         SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
  <chr>                <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSG00000000003       2.98       2.86       3.03       2.94       3.14       3.03       3.09       2.94
2 ENSG00000000419       2.82       2.92       2.89       2.89       2.86       2.92       2.82       2.89
3 ENSG00000000457       2.57       2.53       2.52       2.55       2.48       2.54       2.57       2.55
4 ENSG00000000460       1.95       1.96       1.73       1.90       2.00       1.84       2.09       1.98
5 ENSG00000000971       3.66       3.77       3.88       3.96       3.92       4.05       3.91       4.08
6 ENSG00000001036       3.31       3.23       3.33       3.28       3.24       3.17       3.33       3.23


lcpm_n %>% head
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003   4.968465   4.551329  5.1257352   4.832653   5.501732   5.124204   5.298663   4.831056
ENSG00000000419   4.430383   4.751672  4.6358756   4.672524   4.549387   4.735381   4.416679   4.660451
ENSG00000000457   3.590323   3.471364  3.4035325   3.524635   3.296633   3.470985   3.581820   3.517297
ENSG00000000460   1.510814   1.564892  0.7542605   1.337909   1.673791   1.127202   1.988089   1.616999
ENSG00000000971   7.224535   7.584134  7.9453477   8.209972   8.061492   8.517492   8.044721   8.631768
ENSG00000001036   6.043806   5.793293  6.1130816   5.940744   5.824597   5.581981   6.116828   5.783987

from tidybulk.

stemangiola commented on June 20, 2024

Hello @ConYel

I think it depends on the reference sample that tidybulk and your other code are choosing.

reference_sample | A character string. The name of the reference sample. If NULL the sample with highest total read count will be selected as reference.

library(tidybulk)
data("se_mini")
# Just getting a tibble out of a SE for testing purposes
input_df = se_mini %>% tidybulk() %>% as_tibble() %>% setNames(c("b","a",  "c", "Cell type", "time" , "condition"))

input_df %>% 
    identify_abundant(a, b, c) %>%
    scale_abundance(
        .sample = a,
        .transcript = b,
        .abundance = c, 
        reference_sample = "SRR1740035",
        action = "only"
    )

# A tibble: 5 x 3
  a            TMM multiplier
  <fct>      <dbl>      <dbl>
1 SRR1740034 0.871      1.30 
2 SRR1740035 0.849      1.18 
3 SRR1740043 0.843      2.70 
4 SRR1740058 1.13       0.970
5 SRR1740067 1.42       1.83

Now, from input_df try to use your edgeR code to get the norm.factors choosing the same reference sample.

input_df %>% 
rename(counts = c) %>% 
    as_SummarizedExperiment(a,b,counts) %>%
     keep_abundant() %>%
     edgeR::calcNormFactors(refColumn=2)

$counts
      SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
ACAP1       7151       8028       1470      12264        616
ACP5        2278       2666        350        489        264
ADRB2        298        368        116        153        722
AIM2        3050       3004        304         56        357
ALOX5      10064      11521       9457        118       7852
177 more rows ...

$samples
           group lib.size norm.factors         Cell.type time condition
SRR1740034     1   520515    0.8709091            b_cell  0 d      TRUE
SRR1740035     1   589046    0.8488030            b_cell  1 d      TRUE
SRR1740043     1   258771    0.8431936          monocyte  1 d     FALSE
SRR1740058     1   536818    1.1309222            t_cell  0 d      TRUE
SRR1740067     1   227024    1.4186009 dendritic_myeloid  1 d     FALSE

The TMM factor match, we then calculate scaled counts. The final counts may vary up to a scalar depending on what reference you decide, and small details. Nonetheless, if you plot tidybulk scaled counts, with your own scaled counts (provided the same TMM) should sit at a 45' line. The rest of the analyses will provide the same outcome.

from tidybulk.

ConYel commented on June 20, 2024

Hello @stemangiola
Thank you for the clarification!
I followed your advice and I managed to get the same norm.factors and TMM values using the same sample as reference.

I saw that there was reference_selection_function which is Deprecated.

In the calcNormFactors it is used:
If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.
Did reference_selection_function had the same option it has edgeR when refColumn is NULL?

Is it possible to include it as a default (for NULL) or as an option, or as a warning in which it will inform which sample the scale_abundance() is using as reference?

Thank you for your time!

from tidybulk.

stemangiola commented on June 20, 2024

If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.

This would make sense. However, while for modelling does not really matter which reference you pick, for calculating scaled counts for visualisation, and other statistics, we would lose some resolution for lowly transcribed transcripts in case a sample was deeply sequenced. If we pick the most deeply sequenced as reference we would not lose any resolution, at the cost to insert some quantization random error for the shallow sequenced.

I would need a more compelling argument, but it is a good point of discussion.

or as a warning in which it will inform which sample the scale_abundance() is using as reference

This is less impactful. I will implement this as a message.

from tidybulk.

stemangiola commented on June 20, 2024

Thanks @ConYel for your issue. I implemented your suggestion. Feel free to reopen if necessarily.

from tidybulk.

ConYel commented on June 20, 2024

Thank you @stemangiola for the clarification!

I totally understand your point!

My thinking was more about "backward compatibility" with the original function.
In order to also keep reproducibility between workflows, in case someone wants to move from the original to the tidy way.

from tidybulk.

Normalization method (scaling) about tidybulk HOT 9 CLOSED

Comments (9)

Results

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent