Giter Club home page Giter Club logo

metagxdata-pipeline's Introduction

UPDATED OCTOBER 9, 2015

Gendoo et al

MetaGxData Package Compendium

####################################### VERSION CONTROL

V2.2 - Current Draft

V2.3 - Modification to gene-wise and patient-wise normalization and new datasets added

####################################### To build :

Create tar.gz file: R CMD BUILD MetaGx______

To install:

R CMD INSTALL MetaGx_______

To get esets in data package:

library(MetaGx_____)

source(system.file("extdata", "patientselection.config", package="MetaGx_____"))

source(system.file("extdata", "createEsetList.R", package="MetaGx______"))

########################################

Currently manipulates data from A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes (http://compbio.dfci.harvard.edu/pubs/sbtpaper/data.zip)

Includes TCGA and METABRIC

Total number of expression sets: 39

MetaGxOvarian

Currently manipulates data from FULLVcuratedOvarianData (http://bcb.dfci.harvard.edu/ovariancancer/)

Includes TCGA

Total number of expression sets: 25

metagxdata-pipeline's People

Contributors

gmchen avatar dgendoo avatar natchar avatar bhaibeka avatar mzon7 avatar

Watchers

 avatar Levi Waldron avatar Aedin Culhane avatar James Cloos avatar  avatar  avatar  avatar  avatar

Forkers

zhangyupisa

metagxdata-pipeline's Issues

Dissimilar data in MetaGxBreast package CAL dataset

Sorry for bothering again!
I've identified dissimilar data with the CAL dataset of the MetaGxBreast package with respect with the published in https://www.ebi.ac.uk/arrayexpress/experiments/E-TABM-158/
I've realized that the Concordance.index of gene signatures available in genefu were not near satisfactory in this dataset. After some playing around I found that there is some missmatch between clinical and expression data of the original dataset and the one in MetaGx.
Below some code to reproduce these findings

library(MetaGxBreast)
esets2= loadBreastEsets(loadString = c("CAL","MSK"))

CAL=esets2$esets[["CAL"]]

library(ArrayExpress)
accession="E-TABM-158"
MTAB=getAE(accession,path = "/home/mguerrero/Genetic_alg/Data_sets/MTAB", type = "processed")

MTAB=list(path="/home/mguerrero/Genetic_alg/Data_sets/MTAB",
rawFiles=NULL,
rawArchive=NULL,
processedFiles="breastTumorExpression.txt",
processedArchive="E-TABM-158.processed.1.zip",
sdrf="E-TABM-158.sdrf.txt",
idf="E-TABM-158.idf.txt",
adf="A-AFFY-76.adf.txt")

MTABnames=strsplit(readLines(paste(MTAB$path,MTAB$processedFiles,sep="/"))[1],"\t")[[1]]
MTABset=read.table(paste(MTAB$path,MTAB$processedFiles,sep="/"),sep="\t",skip=2,col.names=MTABnames,row.names=1)

sdrf=read.table(paste(MTAB$path,MTAB$sdrf,sep="/"),sep="\t",header=TRUE,row.names=1,comment.char="")

#If we check, all the colnames of the expression set CAL are present in the "Array.Data.File" column in the MTAB sdrf object

sdrf$genefu.name= gsub("(?i).CEL","",paste("CAL",sdrf$Array.Data.File,sep="_"))
all(colnames(exprs(CAL) )%in% sdrf$genefu.name)
#TRUE

#nevertheless MTAB expression matrix does not have all the samples available in the clinical metadata and it colnames correspond to the Scan.Name column in the sdrf object.
dim(MTABset)[2]
#118
dim(sdrf)[1]
#130
all(colnames(MTABset) %in% sdrf$Scan.Name)
#TRUE

#If we check corresponding Scan.Name of the colnames of the CAL expression set with colnames from MTAB they do not match completely, which would mean that the expression matrix of the CAL MetaGx dataset is misslabeled
ScanNameEset=sdrf[match(colnames(exprs(CAL)), sdrf$genefu.name),"Scan.Name"]
table(colnames(MTABset) %in% ScanNameEset)

#Finally is important to notice that pData from CAL eset does not match either with the data in the MTAB sdrf file

sdrf=sdrf[match(colnames(exprs(CAL)), sdrf$genefu.name),]
identical(colnames(exprs(CAL)),sdrf$genefu.name)
#TRUE

table(pData(CAL)$er, sdrf$Characteristics..EstrogenReceptorStatus.)
cor(pData(CAL)$age_at_initial_pathologic_diagnosis, as.numeric(as.character(sdrf$Characteristics..age.at.diagnosis.)),use="pairwise.complete.obs")
#-0.08980216

Hope you understand what I did!
thanks again for all your work and effort in bringing all this data closer to the users, it has been really useful!

Best!

Martin

sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 grid parallel stats graphics grDevices utils datasets methods
[10] base

other attached packages:
[1] ArrayExpress_1.42.0 bindrcpp_0.2.2 GEOquery_2.50.5 MetaGxBreast_1.2.0
[5] ExperimentHub_1.8.0 AnnotationHub_2.14.2 impute_1.56.0 lattice_0.20-38
[9] AnnotationDbi_1.44.0 IRanges_2.16.0 S4Vectors_0.20.1 illuminaio_0.24.0
[13] genefu_2.14.0 AIMS_1.14.1 Biobase_2.42.0 BiocGenerics_0.28.0
[17] e1071_1.7-0 iC10_1.4.2 iC10TrainingData_1.3.1 pamr_1.55
[21] biomaRt_2.38.0 limma_3.38.3 mclust_5.4.2 survcomp_1.32.0
[25] prodlim_2018.04.18 gplots_3.0.1 cba_0.2-19 proxy_0.4-22
[29] doParallel_1.0.14 iterators_1.0.10 foreach_1.4.4 gpuR_2.0.0
[33] survival_2.43-3 cluster_2.0.7-1

loaded via a namespace (and not attached):
[1] amap_0.8-16 assertive.base_0.0-7 class_7.3-15
[4] XVector_0.22.0 GenomicRanges_1.34.0 base64_2.0
[7] affyio_1.52.0 assertive.sets_0.0-3 bit64_0.9-7
[10] interactiveDisplayBase_1.20.0 xml2_1.2.0 oligoClasses_1.44.0
[13] assertive.data.uk_0.0-2 codetools_0.2-16 splines_3.5.2
[16] knitr_1.21 SuppDists_1.1-9.4 assertive_0.3-5
[19] assertive.data.us_0.0-2 shiny_1.2.0 BiocManager_1.30.4
[22] readr_1.3.1 compiler_3.5.2 httr_1.4.0
[25] assertthat_0.2.0 Matrix_1.2-15 later_0.7.5
[28] htmltools_0.3.6 prettyunits_1.0.2 tools_3.5.2
[31] GenomeInfoDbData_1.2.0 glue_1.3.0 affxparser_1.54.0
[34] dplyr_0.7.8 Rcpp_1.0.0 Biostrings_2.50.2
[37] preprocessCore_1.44.0 gdata_2.18.0 assertive.files_0.0-2
[40] assertive.datetimes_0.0-2 assertive.models_0.0-2 xfun_0.4
[43] stringr_1.3.1 mime_0.6 gtools_3.8.1
[46] XML_3.98-1.16 zlibbioc_1.28.0 hms_0.4.2
[49] promises_1.0.1 SummarizedExperiment_1.12.0 assertive.matrices_0.0-2
[52] assertive.strings_0.0-3 oligo_1.46.0 curl_3.2
[55] yaml_2.2.0 memoise_1.1.0 stringi_1.2.4
[58] RSQLite_2.1.1 rmeta_3.0 caTools_1.17.1.1
[61] BiocParallel_1.16.5 lava_1.6.4 GenomeInfoDb_1.18.1
[64] matrixStats_0.54.0 rlang_0.3.1 pkgconfig_2.0.2
[67] bitops_1.0-6 assertive.data_0.0-3 purrr_0.2.5
[70] bindr_0.1.1 assertive.properties_0.0-4 survivalROC_1.0.3
[73] bit_1.1-14 tidyselect_0.2.5 assertive.code_0.0-3
[76] magrittr_1.5 R6_2.3.0 bootstrap_2017.2
[79] DelayedArray_0.8.0 DBI_1.0.0 pillar_1.3.1
[82] assertive.numbers_0.0-2 RCurl_1.95-4.11 tibble_2.0.0
[85] crayon_1.3.4 assertive.types_0.0-3 KernSmooth_2.23-15
[88] progress_1.2.0 blob_1.1.1 digest_0.6.18
[91] xtable_1.8-3 ff_2.2-14 tidyr_0.8.2
[94] httpuv_1.4.5.1 openssl_1.1 assertive.reflection_0.0-4

IRB patients (GSM values) come from a different GSE

The experimentData and contents of datasetsAll.xslx of IRB lists GSE6532, but the pData() contains sample names such as GSM124994 which are from GSE5460.

experimentData(esets$IRB)
Experiment data
Experimenter name:
Laboratory:
Contact information: http://www-ncbi-nlm-nih-gov.proxy.wexler.hunter.cuny.edu/pubmed/?term=18498629
Title:
URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6532
PMIDs: 18498629
No abstract available.
notes:
summary:
A gene classifier was identified as a predictor of clinical outcome in tam
oxifen-treated breast cancer.
version:
2015-04-27 19:13:07
mapping.method:
maxRowVariance
mapping.group:
EntrezGene.ID

But if you look at pData(esets$IRB), the GSM names are from GSE5460.

GSE19829: probeset mapping problem

This series contains 2 platforms: HGU95v2 and HGU133Plus2. Only 254 genes in common, which is wrong. It must be a bug in the probeset-gene mapping function as the two platforms must be processed separately

Handling of deceased patients before metastasis/recurrence event

The datasets NKI, CAL, UCSF have cases labelled as dmfs_status == living_norecurrence and also vital_status == deceased.

The datasets NKI, STNO2, CAL, UCSF, UNC4, PNC have cases labelled as recurrence_status == living_norecurrence and also vital_status == deceased.

These are cases in which the patient is deceased before distant metastasis or recurrence. We should handle this consistently across datasets: should this be "event positive" (i.e. the event is defined as "metastasis or death"), or "event negative" (i.e. the event is defined as "metastasis" (or "recurrence"), and deceased patients handled as a censored value as if they were lost to follow-up).

Dissimilar data in MetaGxBreast package GSE58644

Thanks for the great MetaGxBreast package you uploaded to Bioconductor! Its been really usefull.
I was working with "GSE58644" dataset when I realized that the dmfs_days values were way out of what is expected.
After some checks, I found that not only dmfs_days were not concordant with the original dataset uploaded in GEO, but dmfs_status did not agree either, below a small code to reproduce these findings.

library(GEOquery)
gds <- getGEO("GSE58644")
gds <- gds[[1]]
original_time= as.numeric(pData(gds)$"time:ch1")*30.41 #(original data is in months and MetaGxBreast values are in days)

original_status= as.numeric(pData(gds)$"event:ch1")

library(MetaGxBreast)

esets2= loadBreastEsets(loadString = c("GSE58644","MSK"))
MetaGx_time= pData(esets2$esets[["GSE58644"]])$dmfs_days
MetaGx_status= pData(esets2$esets[["GSE58644"]])$dmfs_status

identical(rownames(pData(esets2$esets[[1]])),rownames(pData(gds))) #TRUE, Patients have same name and order

table(original_status,MetaGx_status) #Not concordant

plot(original_time,MetaGx_time)
#Time correlates but are in different scales
#it seems that the original value was multiplied by 365.25 instead of 30.41 which would be the correct conversion

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 grid parallel stats graphics grDevices utils datasets
[9] methods base

other attached packages:
[1] breastCancerTRANSBIG_1.20.0 genefu_2.14.0
[3] AIMS_1.14.0 e1071_1.7-0
[5] iC10_1.4.2 iC10TrainingData_1.3.1
[7] pamr_1.55 biomaRt_2.38.0
[9] mclust_5.4.2 survcomp_1.32.0
[11] prodlim_2018.04.18 sva_3.30.0
[13] BiocParallel_1.16.2 a4Base_1.30.0
[15] a4Core_1.30.0 a4Preproc_1.30.0
[17] glmnet_2.0-16 Matrix_1.2-15
[19] multtest_2.38.0 limma_3.38.2
[21] genefilter_1.64.0 mpm_1.0-22
[23] KernSmooth_2.23-15 MASS_7.3-51.1
[25] annaffy_1.54.0 KEGG.db_3.2.3
[27] GO.db_3.7.0 AnnotationDbi_1.44.0
[29] IRanges_2.16.0 S4Vectors_0.20.1
[31] MetaGxBreast_1.2.0 ExperimentHub_1.8.0
[33] AnnotationHub_2.14.1 impute_1.56.0
[35] BiocInstaller_1.30.0 curatedCRCData_2.14.0
[37] bindrcpp_0.2.2 GEOquery_2.50.0
[39] caret_6.0-81 ggplot2_3.1.0
[41] lattice_0.20-38 Biobase_2.42.0
[43] BiocGenerics_0.28.0 matchingR_1.3.0
[45] Rcpp_1.0.0 gpuR_2.0.0
[47] nsga2R_1.0 mco_1.0-15.1
[49] gplots_3.0.1 cba_0.2-19
[51] proxy_0.4-22 doParallel_1.0.14
[53] iterators_1.0.10 foreach_1.4.4
[55] mgcv_1.8-25 nlme_3.1-137
[57] survival_2.43-1 cluster_2.0.7-1

loaded via a namespace (and not attached):
[1] plyr_1.8.4 assertive.files_0.0-2
[3] lazyeval_0.2.1 splines_3.5.1
[5] amap_0.8-16 SuppDists_1.1-9.4
[7] digest_0.6.18 htmltools_0.3.6
[9] gdata_2.18.0 magrittr_1.5
[11] memoise_1.1.0 assertive.datetimes_0.0-2
[13] assertive.numbers_0.0-2 recipes_0.1.4
[15] readr_1.1.1 annotate_1.60.0
[17] gower_0.1.2 matrixStats_0.54.0
[19] prettyunits_1.0.2 colorspace_1.3-2
[21] blob_1.1.1 assertive.strings_0.0-3
[23] dplyr_0.7.8 crayon_1.3.4
[25] RCurl_1.95-4.11 bindr_0.1.1
[27] glue_1.3.0 gtable_0.2.0
[29] ipred_0.9-8 scales_1.0.0
[31] DBI_1.0.0 assertive.data.uk_0.0-2
[33] assertive.models_0.0-2 assertive.code_0.0-3
[35] progress_1.2.0 xtable_1.8-3
[37] bit_1.1-14 assertive.data.us_0.0-2
[39] lava_1.6.3 httr_1.3.1
[41] pkgconfig_2.0.2 XML_3.98-1.16
[43] nnet_7.3-12 tidyselect_0.2.5
[45] rlang_0.3.0.1 reshape2_1.4.3
[47] later_0.7.5 munsell_0.5.0
[49] tools_3.5.1 generics_0.0.1
[51] RSQLite_2.1.1 assertive.reflection_0.0-4
[53] stringr_1.3.1 yaml_2.2.0
[55] bootstrap_2017.2 ModelMetrics_1.2.2
[57] knitr_1.20 bit64_0.9-7
[59] assertive.matrices_0.0-2 caTools_1.17.1.1
[61] purrr_0.2.5 assertive.sets_0.0-3
[63] mime_0.6 xml2_1.2.0
[65] compiler_3.5.1 curl_3.2
[67] interactiveDisplayBase_1.20.0 tibble_1.4.2
[69] stringi_1.2.4 assertive.base_0.0-7
[71] survivalROC_1.0.3 assertive.data_0.0-1
[73] pillar_1.3.0 BiocManager_1.30.4
[75] data.table_1.11.8 bitops_1.0-6
[77] httpuv_1.4.5 assertive.types_0.0-3
[79] R6_2.3.0 assertive.properties_0.0-4
[81] promises_1.0.1 codetools_0.2-15
[83] gtools_3.8.1 assertthat_0.2.0
[85] withr_2.1.2 hms_0.4.2
[87] rpart_4.1-13 timeDate_3043.102
[89] tidyr_0.8.2 class_7.3-14
[91] shiny_1.2.0 lubridate_1.7.4
[93] rmeta_3.0 assertive_0.3-5

survival events must be 0 or 1

Currently, the survival event are death/alive, or recurrence/norecurrence

This is REALLY annoying and these survival events must be replaced by 0 (censoring) or 1 (event) as numeric (not as factors)

Datasets have incorrect ExperimentData

For many datasets, the PMID and GEO accession in experimentData() differ from the annotation in datasetsALL.xlsx.

It is critical that, for each dataset name (e.g. MSK, METABRIC, MUG), three annotations are consistent: (1) the PMID and Accession in datasetsALL.xlsx (on the Google Drive), (2) the PMID and Accession in experimentData, and (3) the pData and expression values in the eset object. The cases below have been discovered to differ in (1) and (2). It is possible that these errors is due to an off-by-one error in experimentData(). However, since we have discovered at least one case of (1) and (2) being consistent but (3) being incorrect (see issue #5), this may warrant a full audit of the data.

Perhaps we can construct each ExpressionSet to include GSM patient IDs whenever available - this will make it easier to cross-compare annotations.

Case 1:
The table in datasetsAll.xlsx gives:
Dataset: MSK, PMID: 16049480, Accession: GSE2603
But the experimentData values give a PMID of 18592372 and GSE10510

Case 2:
From the table:
MUG 18592372 GSE10510
The experimentData values give: PMIDs: 18636107, GSE5364

Case 3:
From the table:
NCCS 18636107 GSE5364
experimentData gives PMIDs: 12917485, no GSE

Case 4:
From the table:
NCI 12917485
experimentData gives PMIDs: 12490681, 11823860

Case 5:
From the table:
NKI 12490681, 11823860 Accession: Rosetta Inpharmatics
experimentData gives PMIDs: GSE20711, PMID 21910250

Case 6:
From the table:
PNC 21910250 GSE20711
experimentData gives PMIDs: 16280042, GSE1456

Case 7:
From the table:
Dataset: STK, PMID: 16280042, Accession: GSE1456
The experimentData values gives a PMID of 12829800

Case 8-18:
The following datasets have mislabelled PMIDs, appears to be shifted by one or two on the spreadsheet.
STNO2 SUPERTAM_HGU133A SUPERTAM_HGU133PLUS2 TRANSBIG UCSF UNC4 UNT UPP VDX METABRIC TCGA

MUG has missing age, overall survival, disease-free survival, etc

Example, _IDC_A001 has NA values for almost all fields (also in /Users/Natchar/Desktop/MetaGxData/MetaGxBreast/curation/breast/uncurated/MUG.csv), but on GEO we have:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM265557
Alter = 72, Year_of_Diagnosis = 1990, Gender = f, pT = 1C, pN = 1BI, pM = X, Number_of_asported_lymphnodes = NA, Number_of_positive_lyphnodes = NA, Level_Estrogen_receptor_IHC = 3, Level_Progesteron_receptor_IHC = 3, Diagnosis = IDC, Lymphocyte_infiltration = NA, Status = NA, Reason_of_death = NA, DFS_Months = 111, OS_Months = 111, , Epithel_Percentage = NA, Menopausal_status_at_First_Tumor_Diagnosis = NA, Surgical_method = MRM,_axill._Lymphadenektomie, Her2neu_DAKO = NA, Neoadjuvant_PCT_with_Anthracyclin = NA, Neoadjuvant_PCT_without_Anthracyclin = NA, Postoperative_Radiation = yes, Postop_adjuvant_Hormontherapy = NA, Adjuvant_PCT_with_Anthracyclin = NA, Adjuvant_PCT_without_Anthracyclin = NA

Errors calling experimentData()

experimentData(esets$UPP) gives an error
Error in if (length(object@abstract) > 0 && all(object@abstract != "")) cat("\n Abstract: A", :
missing value where TRUE/FALSE needed

Also get errors from:
experimentData(esets$VDX) gives an error.
experimentData(esets$DFHCC3) gives an error
experimentData(esets$DUKE) gives an error

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.