ramhiser / datamicroarray Goto Github PK

View Code? Open in Web Editor NEW

101.0 10.0 41.0 179.78 MB

A collection of small-sample, high-dimensional microarray data sets to assess machine-learning algorithms and models.

R 100.00%

r cancer colon-cancer high-dimensional-data machine-learning

datamicroarray's People

Contributors

Stargazers

Watchers

datamicroarray's Issues

Add Su et al. (2002) Novartis multi-tissue data set

I worked with this data set in the ClustOmit paper. See the su.r script for more information.

describe_data() deviates from datasets

chiaretti:
describe_data() says n = 111, p = 12625 and K = 2.
But when I load the dataset, I get a matrix of dimension 128 x 12625 and the class vector has 6 levels.

golub:
describe_data() says p = 3, but class vector has only 2 levels.

shipp:
describe_data() says n = 58 and p = 6817, but data matrix has dimension 77 x 7129.

Add Adam et al. (2002) prostate cancer data set

The mass-spec prostate cancer data set is mentioned in the Levina et al. (2008) Annals of Applied Statistics paper entitled "Sparse estimation of large covariance matrices via a nested Lasso penalty"

Investigate improved storage of data sets

Currently, we store each data set in the /data folder so that they are installed when the package is installed. Given that some of the data sets are quite large, it would be desirable to load these data sets in a more dynamic manner.

For example: say the data sets are stored on github, but only a few select (small) data sets are included when the package is installed initially. But then add a helper function to download the additional data sets.

problems with loading library

running library(datamicroarray) in vim plugin Nvim-R for the first time, I get the following error:

Error in read.table(idx, sep = "\t", quote = "", stringsAsFactors = FALSE) : 
  no lines available in input

traceback direct to

5: stop("no lines available in input")
4: read.table(idx, sep = "\t", quote = "", stringsAsFactors = FALSE)
3: GetFunDescription(packlist)
2: nvim.bol(paste0(bdir, "omnils_", p, "_", pvi), p, TRUE)
1: nvimcom:::nvim.buildomnils("datamicroarray")

running the load library again in the same R session the problem would not occur.

in the Nvim-R issue tracker: jalvesaq/Nvim-R#277

Add Sorlie et al. (2001) SRBCT data set

I worked with this data set in the ClustOmit paper. See the sorlie.r script for more information.

Add Gordon et al. (2002) lung cancer data set

For more information, see the Cai et al. (2011) JASA paper entitled "A Direct Estimation Approach to Sparse Linear Discriminant Analysis"

Add Yeoh et al. (2002) St. Jude Leukemia Data Set

Examine Gravier (2010) data set

Måns Thulin from Uppsala University sent the following email to me:

I am now planning to use the Gravier (2010) data to illustrate a new method in a paper, but was wondering if perhaps some of the patients in the study have been misclassified in your R package. According to the Gravier et al. paper and your description of the data on the wiki, there should be 111 patients labelled "good" and 57 labelled "poor". However, when I import the data into R, I get the following:

summary(gravier$y)
good poor
106 62

The numbers of patients (168) and features (2,905) are correct, but there seems to be a problem with the class labels. Have 5 "good" patients been labelled as "poor" or is there in fact a misprint in the Gravier et al. paper? Any insights that you could provide regarding this would be deeply appreciated!

Add PCA and LDA plots of data sets descriptions on wiki

Plot first two dimensions of data using PCA and LDA.
Color each point according to its class.

Add data set descriptions to github Wiki

Alon et al. (1999)
Bhattacharjee et al. (2001)
Chiaretti et al. (2004)
Golub et al. (1999)
Gordon et al. (2002)
Gravier et al. (2010)
Khan et al. (2001) - SRBCT
Oberthuer et al. (2006) - Neuroblastoma
Pomeroy et al. (2002) - CNS
P53
Shipp et al. (2002)
Singh et al. (2002)
van't Veer et al. (2002)
West et al. (2001)
Yeoh et al. (2002) - St. Jude

Update GEO/ArrayExpress scripts to use Bioconductor

Before I learned about the GEOquery and ArrayExpress packages on Bioconductor, I was downloading the data manually.

Update the scripts for each of the following data sets:

Borovecki (2005)
Christensen (2009)
Gravier (2010)

Add helper function to list all data sets

This function will be useful in simulations where we wish to apply classifiers to every data set collected. For example, if we are interested only in data sets with K = 2, we need only load these data sets.

The function should list the following:

Object name, e.g., the Golub et al. (1999) is located in golub
The number of classes, K
The sample size, N
The number of features, p

Add data set descriptions to help.r

Alon et al. (1999)
Bhattacharjee et al. (2001)
Chiaretti et al. (2004)
Golub et al. (1999)
Gordon et al. (2002)
Gravier et al. (2010)
Khan et al. (2001) - SRBCT
Oberthuer et al. (2006) - Neuroblastoma
Pomeroy et al. (2002) - CNS
P53
Shipp et al. (2002)
Singh et al. (2002)
van't Veer et al. (2002)
West et al. (2001)
Yeoh et al. (2002) - St. Jude

Chin data set not exported as numeric matrix

The matrix is exported as a character matrix:

> library(datamicroarray)
> data(chin)
> chin$x[, 1]
  [1] "10.169815" "10.565664" "9.589976"  "10.324175" "9.784195"  "8.969536"
  [7] "10.973057" "11.399529" "10.798559" "9.685487"  "12.051186" "10.030907"
 [13] "10.307187" "11.320309" "10.404591" "10.833785" "9.735426"  "10.797899"
 [19] "10.627682" "10.631056" "10.335046" "9.758425"  "10.472313" "10.469456"
 [25] "10.266331" "9.789535"  "10.788861" "11.191206" "10.337112" "10.871078"
 [31] "9.896685"  "9.651166"  "10.793316" "10.475492" "9.740225"  "10.437926"
 [37] "9.941238"  "10.752173" "11.025179" "10.449146" "10.502874" "9.887005"
 [43] "10.324535" "11.92731"  "10.011643" "9.074154"  "9.650978"  "10.960044"
 [49] "11.080833" "10.730092" "10.144769" "10.258973" "11.342681" "11.20937"
 [55] "10.439279" "9.872279"  "10.067042" "10.843696" "9.799298"  "10.762967"
 [61] "11.250308" "10.739098" "10.967985" "10.139285" "10.482729" "11.012492"
 [67] "10.839745" "11.115753" "10.995832" "10.024971" "10.111507" "11.373869"
 [73] "10.818594" "11.437675" "10.709085" "11.275032" "10.537405" "10.175087"
 [79] "10.822135" "9.781922"  "9.165403"  "10.538037" "8.688913"  "10.582591"
 [85] "10.726001" "10.150915" "10.373924" "10.986752" "11.470086" "10.666458"
 [91] "10.65508"  "11.493546" "10.419414" "10.164545" "9.44763"   "10.199079"
 [97] "10.612112" "10.538597" "10.92159"  "11.112414" "9.917373"  "10.352251"
[103] "10.749506" "10.191069" "10.953824" "7.211304"  "9.702876"  "10.076364"
[109] "11.080688" "10.278594" "11.371984" "10.271792" "10.553228" "10.193828"
[115] "11.170514" "10.349621" "10.679596" "10.031797"

Add more data sets

This bioinformatics lab has a lot of data sets along with descriptions, data analyses, and additional information.

bug by install_github("datamicroarry","ramey")

Dear Ramhiser,

i wanted to install Datamicroarry package , but unfortunately i always get an error.

It looks like this:
install.packages("devtools") # okay
devtools::install_github("ramey/datamicroarry") # but here:

Downloading github repo ramey/datamicroarry@master
Error in download(dest, src, auth) : client error: (404) Not Found
Could you help me on this?

With kind regards,
SHASHANK K S

Add van't Veer et al. (2002) breast cancer data set

The data set is given in the nki object in the breastCancerNKI package on Bioconductor.

Type abstract(nki), and notice that there should be "151 had lymph-node-negaitve disease, and 144 had lymph-node-positive disease." However, in exprs(nki), there are 337 observations. It is unclear which are the 144 + 151 = 295 observations.

Finalize data set descriptions in wiki

Finalize the wiki description of each of the following data sets:

~~Alon (1999)~~
Borovecki (2005)
Burczynski (2006)
~~Chiaretti (2004)~~
Chin (2006)
Chowdary (2006)
Christensen (2009)
~~Golub (1999)~~
Gordon (2002)
~~Gravier (2010)~~
~~Khan (2001)~~
Pomeroy (2002)
~~Shipp (2002)~~
~~Singh (2002)~~
~~Sorlie (2001)~~
Su (2002)
Subramanian (2005)
Tian (2003)
West (2001)
~~Yeoh (2002)~~

Add data sets from this PLOS ONE paper

This PLOS ONE paper provides a table of 19 2-population microarray data sets. I have gathered the majority of these data sets into the datamicroarray package, but below I have listed several the papers that I missed. (The data set number provided in Table 2 is given in square brackets.)

Hodges (2006) HD Caudate [2]
Hodges (2006) HD Cerebellum [4]
Okada (2003) Liver Cancer [10]
Beer (2002) Lung Cancer [12]
Iizuka (2003) Liver Cancer [13]
Dhanasekaran (2001) Prostate Cancer [14]
Gruvberger (2001) Breast Cancer [15]
Berchuck (2005) Ovarian Cancer [17]
Zapala (2005) Neural Tissue [18]

ramhiser / datamicroarray Goto Github PK

datamicroarray's People

Contributors

Stargazers

Watchers

Forkers

datamicroarray's Issues

Recommend Projects

Recommend Topics

Recommend Org