ramhiser / datamicroarray Goto Github PK
View Code? Open in Web Editor NEWA collection of small-sample, high-dimensional microarray data sets to assess machine-learning algorithms and models.
A collection of small-sample, high-dimensional microarray data sets to assess machine-learning algorithms and models.
I worked with this data set in the ClustOmit paper. See the su.r
script for more information.
chiaretti:
describe_data() says n = 111, p = 12625 and K = 2.
But when I load the dataset, I get a matrix of dimension 128 x 12625 and the class vector has 6 levels.
golub:
describe_data() says p = 3, but class vector has only 2 levels.
shipp:
describe_data() says n = 58 and p = 6817, but data matrix has dimension 77 x 7129.
The mass-spec prostate cancer data set is mentioned in the Levina et al. (2008) Annals of Applied Statistics paper entitled "Sparse estimation of large covariance matrices via a nested Lasso penalty"
Currently, we store each data set in the /data
folder so that they are installed when the package is installed. Given that some of the data sets are quite large, it would be desirable to load these data sets in a more dynamic manner.
For example: say the data sets are stored on github, but only a few select (small) data sets are included when the package is installed initially. But then add a helper function to download the additional data sets.
running library(datamicroarray)
in vim plugin Nvim-R for the first time, I get the following error:
Error in read.table(idx, sep = "\t", quote = "", stringsAsFactors = FALSE) :
no lines available in input
traceback direct to
5: stop("no lines available in input")
4: read.table(idx, sep = "\t", quote = "", stringsAsFactors = FALSE)
3: GetFunDescription(packlist)
2: nvim.bol(paste0(bdir, "omnils_", p, "_", pvi), p, TRUE)
1: nvimcom:::nvim.buildomnils("datamicroarray")
running the load library again in the same R session the problem would not occur.
in the Nvim-R issue tracker: jalvesaq/Nvim-R#277
I worked with this data set in the ClustOmit paper. See the sorlie.r
script for more information.
For more information, see the Cai et al. (2011) JASA paper entitled "A Direct Estimation Approach to Sparse Linear Discriminant Analysis"
Måns Thulin from Uppsala University sent the following email to me:
I am now planning to use the Gravier (2010) data to illustrate a new method in a paper, but was wondering if perhaps some of the patients in the study have been misclassified in your R package. According to the Gravier et al. paper and your description of the data on the wiki, there should be 111 patients labelled "good" and 57 labelled "poor". However, when I import the data into R, I get the following:
summary(gravier$y)
good poor
106 62The numbers of patients (168) and features (2,905) are correct, but there seems to be a problem with the class labels. Have 5 "good" patients been labelled as "poor" or is there in fact a misprint in the Gravier et al. paper? Any insights that you could provide regarding this would be deeply appreciated!
Before I learned about the GEOquery
and ArrayExpress
packages on Bioconductor, I was downloading the data manually.
Update the scripts for each of the following data sets:
This function will be useful in simulations where we wish to apply classifiers to every data set collected. For example, if we are interested only in data sets with K = 2
, we need only load these data sets.
The function should list the following:
golub
K
N
p
The matrix is exported as a character matrix:
> library(datamicroarray)
> data(chin)
> chin$x[, 1]
[1] "10.169815" "10.565664" "9.589976" "10.324175" "9.784195" "8.969536"
[7] "10.973057" "11.399529" "10.798559" "9.685487" "12.051186" "10.030907"
[13] "10.307187" "11.320309" "10.404591" "10.833785" "9.735426" "10.797899"
[19] "10.627682" "10.631056" "10.335046" "9.758425" "10.472313" "10.469456"
[25] "10.266331" "9.789535" "10.788861" "11.191206" "10.337112" "10.871078"
[31] "9.896685" "9.651166" "10.793316" "10.475492" "9.740225" "10.437926"
[37] "9.941238" "10.752173" "11.025179" "10.449146" "10.502874" "9.887005"
[43] "10.324535" "11.92731" "10.011643" "9.074154" "9.650978" "10.960044"
[49] "11.080833" "10.730092" "10.144769" "10.258973" "11.342681" "11.20937"
[55] "10.439279" "9.872279" "10.067042" "10.843696" "9.799298" "10.762967"
[61] "11.250308" "10.739098" "10.967985" "10.139285" "10.482729" "11.012492"
[67] "10.839745" "11.115753" "10.995832" "10.024971" "10.111507" "11.373869"
[73] "10.818594" "11.437675" "10.709085" "11.275032" "10.537405" "10.175087"
[79] "10.822135" "9.781922" "9.165403" "10.538037" "8.688913" "10.582591"
[85] "10.726001" "10.150915" "10.373924" "10.986752" "11.470086" "10.666458"
[91] "10.65508" "11.493546" "10.419414" "10.164545" "9.44763" "10.199079"
[97] "10.612112" "10.538597" "10.92159" "11.112414" "9.917373" "10.352251"
[103] "10.749506" "10.191069" "10.953824" "7.211304" "9.702876" "10.076364"
[109] "11.080688" "10.278594" "11.371984" "10.271792" "10.553228" "10.193828"
[115] "11.170514" "10.349621" "10.679596" "10.031797"
This bioinformatics lab has a lot of data sets along with descriptions, data analyses, and additional information.
Dear Ramhiser,
i wanted to install Datamicroarry package , but unfortunately i always get an error.
It looks like this:
install.packages("devtools") # okay
devtools::install_github("ramey/datamicroarry") # but here:
Downloading github repo ramey/datamicroarry@master
Error in download(dest, src, auth) : client error: (404) Not Found
Could you help me on this?
With kind regards,
SHASHANK K S
The data set is given in the nki
object in the breastCancerNKI
package on Bioconductor.
Type abstract(nki)
, and notice that there should be "151 had lymph-node-negaitve disease, and 144 had lymph-node-positive disease." However, in exprs(nki)
, there are 337 observations. It is unclear which are the 144 + 151 = 295 observations.
Finalize the wiki description of each of the following data sets:
This PLOS ONE paper provides a table of 19 2-population microarray data sets. I have gathered the majority of these data sets into the datamicroarray
package, but below I have listed several the papers that I missed. (The data set number provided in Table 2 is given in square brackets.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.