bhklab / metagxdata-pipeline Goto Github PK

View Code? Open in Web Editor NEW

0.0 8.0 1.0 324.18 MB

MetaGxData Packages Compedium

R 99.97% Shell 0.01% TeX 0.03%

metagxdata-pipeline's Introduction

UPDATED OCTOBER 9, 2015

Gendoo et al

MetaGxData Package Compendium

####################################### VERSION CONTROL

V2.2 - Current Draft

V2.3 - Modification to gene-wise and patient-wise normalization and new datasets added

####################################### To build :

Create tar.gz file: R CMD BUILD MetaGx______

To install:

R CMD INSTALL MetaGx_______

To get esets in data package:

library(MetaGx_____)

source(system.file("extdata", "patientselection.config", package="MetaGx_____"))

source(system.file("extdata", "createEsetList.R", package="MetaGx______"))

########################################

Currently manipulates data from A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes (http://compbio.dfci.harvard.edu/pubs/sbtpaper/data.zip)

Includes TCGA and METABRIC

Total number of expression sets: 39

MetaGxOvarian

Currently manipulates data from FULLVcuratedOvarianData (http://bcb.dfci.harvard.edu/ovariancancer/)

Includes TCGA

Total number of expression sets: 25

metagxdata-pipeline's People

Contributors

Watchers

Forkers

zhangyupisa

metagxdata-pipeline's Issues

METABRIC has recurrence_status, but days_to_recurrence column is NA

If this data is available, we should populate the column.

Dissimilar data in MetaGxBreast package CAL dataset

Sorry for bothering again!
I've identified dissimilar data with the CAL dataset of the MetaGxBreast package with respect with the published in https://www.ebi.ac.uk/arrayexpress/experiments/E-TABM-158/
I've realized that the Concordance.index of gene signatures available in genefu were not near satisfactory in this dataset. After some playing around I found that there is some missmatch between clinical and expression data of the original dataset and the one in MetaGx.
Below some code to reproduce these findings

library(MetaGxBreast)
esets2= loadBreastEsets(loadString = c("CAL","MSK"))

CAL=esets2$esets[["CAL"]]

library(ArrayExpress)
accession="E-TABM-158"
MTAB=getAE(accession,path = "/home/mguerrero/Genetic_alg/Data_sets/MTAB", type = "processed")

MTAB=list(path="/home/mguerrero/Genetic_alg/Data_sets/MTAB",
rawFiles=NULL,
rawArchive=NULL,
processedFiles="breastTumorExpression.txt",
processedArchive="E-TABM-158.processed.1.zip",
sdrf="E-TABM-158.sdrf.txt",
idf="E-TABM-158.idf.txt",
adf="A-AFFY-76.adf.txt")

MTABnames=strsplit(readLines(paste(MTAB$path,MTAB$processedFiles,sep="/"))[1],"\t")[[1]]
MTABset=read.table(paste(MTAB$path,MTAB$processedFiles,sep="/"),sep="\t",skip=2,col.names=MTABnames,row.names=1)

sdrf=read.table(paste(MTAB$path,MTAB$sdrf,sep="/"),sep="\t",header=TRUE,row.names=1,comment.char="")

#If we check, all the colnames of the expression set CAL are present in the "Array.Data.File" column in the MTAB sdrf object

sdrf$genefu.name= gsub("(?i).CEL","",paste("CAL",sdrf$Array.Data.File,sep="_"))
all(colnames(exprs(CAL) )%in% sdrf$genefu.name)
#TRUE

#nevertheless MTAB expression matrix does not have all the samples available in the clinical metadata and it colnames correspond to the Scan.Name column in the sdrf object.
dim(MTABset)[2]
#118
dim(sdrf)[1]
#130
all(colnames(MTABset) %in% sdrf$Scan.Name)
#TRUE

#If we check corresponding Scan.Name of the colnames of the CAL expression set with colnames from MTAB they do not match completely, which would mean that the expression matrix of the CAL MetaGx dataset is misslabeled
ScanNameEset=sdrf[match(colnames(exprs(CAL)), sdrf$genefu.name),"Scan.Name"]
table(colnames(MTABset) %in% ScanNameEset)

#Finally is important to notice that pData from CAL eset does not match either with the data in the MTAB sdrf file

sdrf=sdrf[match(colnames(exprs(CAL)), sdrf$genefu.name),]
identical(colnames(exprs(CAL)),sdrf$genefu.name)
#TRUE

table(pData(CAL)$er, sdrf$Characteristics..EstrogenReceptorStatus.)
cor(pData(CAL)$age_at_initial_pathologic_diagnosis, as.numeric(as.character(sdrf$Characteristics..age.at.diagnosis.)),use="pairwise.complete.obs")
#-0.08980216

Hope you understand what I did!
thanks again for all your work and effort in bringing all this data closer to the users, it has been really useful!

Best!

Martin

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 grid parallel stats graphics grDevices utils datasets methods
[10] base

other attached packages:
[1] ArrayExpress_1.42.0 bindrcpp_0.2.2 GEOquery_2.50.5 MetaGxBreast_1.2.0
[5] ExperimentHub_1.8.0 AnnotationHub_2.14.2 impute_1.56.0 lattice_0.20-38
[9] AnnotationDbi_1.44.0 IRanges_2.16.0 S4Vectors_0.20.1 illuminaio_0.24.0
[13] genefu_2.14.0 AIMS_1.14.1 Biobase_2.42.0 BiocGenerics_0.28.0
[17] e1071_1.7-0 iC10_1.4.2 iC10TrainingData_1.3.1 pamr_1.55
[21] biomaRt_2.38.0 limma_3.38.3 mclust_5.4.2 survcomp_1.32.0
[25] prodlim_2018.04.18 gplots_3.0.1 cba_0.2-19 proxy_0.4-22
[29] doParallel_1.0.14 iterators_1.0.10 foreach_1.4.4 gpuR_2.0.0
[33] survival_2.43-3 cluster_2.0.7-1

loaded via a namespace (and not attached):
[1] amap_0.8-16 assertive.base_0.0-7 class_7.3-15
[4] XVector_0.22.0 GenomicRanges_1.34.0 base64_2.0
[7] affyio_1.52.0 assertive.sets_0.0-3 bit64_0.9-7
[10] interactiveDisplayBase_1.20.0 xml2_1.2.0 oligoClasses_1.44.0
[13] assertive.data.uk_0.0-2 codetools_0.2-16 splines_3.5.2
[16] knitr_1.21 SuppDists_1.1-9.4 assertive_0.3-5
[19] assertive.data.us_0.0-2 shiny_1.2.0 BiocManager_1.30.4
[22] readr_1.3.1 compiler_3.5.2 httr_1.4.0
[25] assertthat_0.2.0 Matrix_1.2-15 later_0.7.5
[28] htmltools_0.3.6 prettyunits_1.0.2 tools_3.5.2
[31] GenomeInfoDbData_1.2.0 glue_1.3.0 affxparser_1.54.0
[34] dplyr_0.7.8 Rcpp_1.0.0 Biostrings_2.50.2
[37] preprocessCore_1.44.0 gdata_2.18.0 assertive.files_0.0-2
[40] assertive.datetimes_0.0-2 assertive.models_0.0-2 xfun_0.4
[43] stringr_1.3.1 mime_0.6 gtools_3.8.1
[46] XML_3.98-1.16 zlibbioc_1.28.0 hms_0.4.2
[49] promises_1.0.1 SummarizedExperiment_1.12.0 assertive.matrices_0.0-2
[52] assertive.strings_0.0-3 oligo_1.46.0 curl_3.2
[55] yaml_2.2.0 memoise_1.1.0 stringi_1.2.4
[58] RSQLite_2.1.1 rmeta_3.0 caTools_1.17.1.1
[61] BiocParallel_1.16.5 lava_1.6.4 GenomeInfoDb_1.18.1
[64] matrixStats_0.54.0 rlang_0.3.1 pkgconfig_2.0.2
[67] bitops_1.0-6 assertive.data_0.0-3 purrr_0.2.5
[70] bindr_0.1.1 assertive.properties_0.0-4 survivalROC_1.0.3
[73] bit_1.1-14 tidyselect_0.2.5 assertive.code_0.0-3
[76] magrittr_1.5 R6_2.3.0 bootstrap_2017.2
[79] DelayedArray_0.8.0 DBI_1.0.0 pillar_1.3.1
[82] assertive.numbers_0.0-2 RCurl_1.95-4.11 tibble_2.0.0
[85] crayon_1.3.4 assertive.types_0.0-3 KernSmooth_2.23-15
[88] progress_1.2.0 blob_1.1.1 digest_0.6.18
[91] xtable_1.8-3 ff_2.2-14 tidyr_0.8.2
[94] httpuv_1.4.5.1 openssl_1.1 assertive.reflection_0.0-4

IRB patients (GSM values) come from a different GSE

The experimentData and contents of datasetsAll.xslx of IRB lists GSE6532, but the pData() contains sample names such as GSM124994 which are from GSE5460.

experimentData(esets$IRB)
Experiment data
Experimenter name:
Laboratory:
Contact information: http://www-ncbi-nlm-nih-gov.proxy.wexler.hunter.cuny.edu/pubmed/?term=18498629
Title:
URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6532
PMIDs: 18498629
No abstract available.
notes:
summary:
A gene classifier was identified as a predictor of clinical outcome in tam
oxifen-treated breast cancer.
version:
2015-04-27 19:13:07
mapping.method:
maxRowVariance
mapping.group:
EntrezGene.ID

But if you look at pData(esets$IRB), the GSM names are from GSE5460.

GSE19829: probeset mapping problem

This series contains 2 platforms: HGU95v2 and HGU133Plus2. Only 254 genes in common, which is wrong. It must be a bug in the probeset-gene mapping function as the two platforms must be processed separately

MetaGxOvarian, make EntrezGene.ID in fData numeric (instead of factor)

Handling of deceased patients before metastasis/recurrence event

The datasets NKI, CAL, UCSF have cases labelled as dmfs_status == living_norecurrence and also vital_status == deceased.

The datasets NKI, STNO2, CAL, UCSF, UNC4, PNC have cases labelled as recurrence_status == living_norecurrence and also vital_status == deceased.

These are cases in which the patient is deceased before distant metastasis or recurrence. We should handle this consistently across datasets: should this be "event positive" (i.e. the event is defined as "metastasis or death"), or "event negative" (i.e. the event is defined as "metastasis" (or "recurrence"), and deceased patients handled as a censored value as if they were lost to follow-up).

In MetaGxBreast, GSE25066 should have a pData column with NA days_to_death

Currently, days_to_death column does not exist

Dissimilar data in MetaGxBreast package GSE58644

Thanks for the great MetaGxBreast package you uploaded to Bioconductor! Its been really usefull.
I was working with "GSE58644" dataset when I realized that the dmfs_days values were way out of what is expected.
After some checks, I found that not only dmfs_days were not concordant with the original dataset uploaded in GEO, but dmfs_status did not agree either, below a small code to reproduce these findings.

library(GEOquery)
gds <- getGEO("GSE58644")
gds <- gds[[1]]
original_time= as.numeric(pData(gds)$"time:ch1")*30.41 #(original data is in months and MetaGxBreast values are in days)

original_status= as.numeric(pData(gds)$"event:ch1")

library(MetaGxBreast)

esets2= loadBreastEsets(loadString = c("GSE58644","MSK"))
MetaGx_time= pData(esets2$esets[["GSE58644"]])$dmfs_days
MetaGx_status= pData(esets2$esets[["GSE58644"]])$dmfs_status

identical(rownames(pData(esets2$esets[[1]])),rownames(pData(gds))) #TRUE, Patients have same name and order

table(original_status,MetaGx_status) #Not concordant

plot(original_time,MetaGx_time)
#Time correlates but are in different scales
#it seems that the original value was multiplied by 365.25 instead of 30.41 which would be the correct conversion

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 grid parallel stats graphics grDevices utils datasets
[9] methods base

other attached packages:
[1] breastCancerTRANSBIG_1.20.0 genefu_2.14.0
[3] AIMS_1.14.0 e1071_1.7-0
[5] iC10_1.4.2 iC10TrainingData_1.3.1
[7] pamr_1.55 biomaRt_2.38.0
[9] mclust_5.4.2 survcomp_1.32.0
[11] prodlim_2018.04.18 sva_3.30.0
[13] BiocParallel_1.16.2 a4Base_1.30.0
[15] a4Core_1.30.0 a4Preproc_1.30.0
[17] glmnet_2.0-16 Matrix_1.2-15
[19] multtest_2.38.0 limma_3.38.2
[21] genefilter_1.64.0 mpm_1.0-22
[23] KernSmooth_2.23-15 MASS_7.3-51.1
[25] annaffy_1.54.0 KEGG.db_3.2.3
[27] GO.db_3.7.0 AnnotationDbi_1.44.0
[29] IRanges_2.16.0 S4Vectors_0.20.1
[31] MetaGxBreast_1.2.0 ExperimentHub_1.8.0
[33] AnnotationHub_2.14.1 impute_1.56.0
[35] BiocInstaller_1.30.0 curatedCRCData_2.14.0
[37] bindrcpp_0.2.2 GEOquery_2.50.0
[39] caret_6.0-81 ggplot2_3.1.0
[41] lattice_0.20-38 Biobase_2.42.0
[43] BiocGenerics_0.28.0 matchingR_1.3.0
[45] Rcpp_1.0.0 gpuR_2.0.0
[47] nsga2R_1.0 mco_1.0-15.1
[49] gplots_3.0.1 cba_0.2-19
[51] proxy_0.4-22 doParallel_1.0.14
[53] iterators_1.0.10 foreach_1.4.4
[55] mgcv_1.8-25 nlme_3.1-137
[57] survival_2.43-1 cluster_2.0.7-1

loaded via a namespace (and not attached):
[1] plyr_1.8.4 assertive.files_0.0-2
[3] lazyeval_0.2.1 splines_3.5.1
[5] amap_0.8-16 SuppDists_1.1-9.4
[7] digest_0.6.18 htmltools_0.3.6
[9] gdata_2.18.0 magrittr_1.5
[11] memoise_1.1.0 assertive.datetimes_0.0-2
[13] assertive.numbers_0.0-2 recipes_0.1.4
[15] readr_1.1.1 annotate_1.60.0
[17] gower_0.1.2 matrixStats_0.54.0
[19] prettyunits_1.0.2 colorspace_1.3-2
[21] blob_1.1.1 assertive.strings_0.0-3
[23] dplyr_0.7.8 crayon_1.3.4
[25] RCurl_1.95-4.11 bindr_0.1.1
[27] glue_1.3.0 gtable_0.2.0
[29] ipred_0.9-8 scales_1.0.0
[31] DBI_1.0.0 assertive.data.uk_0.0-2
[33] assertive.models_0.0-2 assertive.code_0.0-3
[35] progress_1.2.0 xtable_1.8-3
[37] bit_1.1-14 assertive.data.us_0.0-2
[39] lava_1.6.3 httr_1.3.1
[41] pkgconfig_2.0.2 XML_3.98-1.16
[43] nnet_7.3-12 tidyselect_0.2.5
[45] rlang_0.3.0.1 reshape2_1.4.3
[47] later_0.7.5 munsell_0.5.0
[49] tools_3.5.1 generics_0.0.1
[51] RSQLite_2.1.1 assertive.reflection_0.0-4
[53] stringr_1.3.1 yaml_2.2.0
[55] bootstrap_2017.2 ModelMetrics_1.2.2
[57] knitr_1.20 bit64_0.9-7
[59] assertive.matrices_0.0-2 caTools_1.17.1.1
[61] purrr_0.2.5 assertive.sets_0.0-3
[63] mime_0.6 xml2_1.2.0
[65] compiler_3.5.1 curl_3.2
[67] interactiveDisplayBase_1.20.0 tibble_1.4.2
[69] stringi_1.2.4 assertive.base_0.0-7
[71] survivalROC_1.0.3 assertive.data_0.0-1
[73] pillar_1.3.0 BiocManager_1.30.4
[75] data.table_1.11.8 bitops_1.0-6
[77] httpuv_1.4.5 assertive.types_0.0-3
[79] R6_2.3.0 assertive.properties_0.0-4
[81] promises_1.0.1 codetools_0.2-15
[83] gtools_3.8.1 assertthat_0.2.0
[85] withr_2.1.2 hms_0.4.2
[87] rpart_4.1-13 timeDate_3043.102
[89] tidyr_0.8.2 class_7.3-14
[91] shiny_1.2.0 lubridate_1.7.4
[93] rmeta_3.0 assertive_0.3-5

survival events must be 0 or 1

Currently, the survival event are death/alive, or recurrence/norecurrence

This is REALLY annoying and these survival events must be replaced by 0 (censoring) or 1 (event) as numeric (not as factors)

SUPERTAM_HGU133A should be removed (for now) then split into separate ExpressionSets

SUPERTAM_HGU133A is composed of multiple studies, which should be separate ExpressionSets

Datasets have incorrect ExperimentData

For many datasets, the PMID and GEO accession in experimentData() differ from the annotation in datasetsALL.xlsx.

It is critical that, for each dataset name (e.g. MSK, METABRIC, MUG), three annotations are consistent: (1) the PMID and Accession in datasetsALL.xlsx (on the Google Drive), (2) the PMID and Accession in experimentData, and (3) the pData and expression values in the eset object. The cases below have been discovered to differ in (1) and (2). It is possible that these errors is due to an off-by-one error in experimentData(). However, since we have discovered at least one case of (1) and (2) being consistent but (3) being incorrect (see issue #5), this may warrant a full audit of the data.

Perhaps we can construct each ExpressionSet to include GSM patient IDs whenever available - this will make it easier to cross-compare annotations.

Case 1:
The table in datasetsAll.xlsx gives:
Dataset: MSK, PMID: 16049480, Accession: GSE2603
But the experimentData values give a PMID of 18592372 and GSE10510

Case 2:
From the table:
MUG 18592372 GSE10510
The experimentData values give: PMIDs: 18636107, GSE5364

Case 3:
From the table:
NCCS 18636107 GSE5364
experimentData gives PMIDs: 12917485, no GSE

Case 4:
From the table:
NCI 12917485
experimentData gives PMIDs: 12490681, 11823860

Case 5:
From the table:
NKI 12490681, 11823860 Accession: Rosetta Inpharmatics
experimentData gives PMIDs: GSE20711, PMID 21910250

Case 6:
From the table:
PNC 21910250 GSE20711
experimentData gives PMIDs: 16280042, GSE1456

Case 7:
From the table:
Dataset: STK, PMID: 16280042, Accession: GSE1456
The experimentData values gives a PMID of 12829800

Case 8-18:
The following datasets have mislabelled PMIDs, appears to be shifted by one or two on the spreadsheet.
STNO2 SUPERTAM_HGU133A SUPERTAM_HGU133PLUS2 TRANSBIG UCSF UNC4 UNT UPP VDX METABRIC TCGA

In METABRIC, curate disease-specific survival

Note that METABRIC has three survival states: living, deceased (disease-related), deceased (disease-unrelated)

Process and integrate GSE74821 (1000+ BC patient on proSigna) #1

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE74821

MUG has missing age, overall survival, disease-free survival, etc

Example, _IDC_A001 has NA values for almost all fields (also in /Users/Natchar/Desktop/MetaGxData/MetaGxBreast/curation/breast/uncurated/MUG.csv), but on GEO we have:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM265557
Alter = 72, Year_of_Diagnosis = 1990, Gender = f, pT = 1C, pN = 1BI, pM = X, Number_of_asported_lymphnodes = NA, Number_of_positive_lyphnodes = NA, Level_Estrogen_receptor_IHC = 3, Level_Progesteron_receptor_IHC = 3, Diagnosis = IDC, Lymphocyte_infiltration = NA, Status = NA, Reason_of_death = NA, DFS_Months = 111, OS_Months = 111, , Epithel_Percentage = NA, Menopausal_status_at_First_Tumor_Diagnosis = NA, Surgical_method = MRM,_axill._Lymphadenektomie, Her2neu_DAKO = NA, Neoadjuvant_PCT_with_Anthracyclin = NA, Neoadjuvant_PCT_without_Anthracyclin = NA, Postoperative_Radiation = yes, Postop_adjuvant_Hormontherapy = NA, Adjuvant_PCT_with_Anthracyclin = NA, Adjuvant_PCT_without_Anthracyclin = NA