bhklab / mrmre Goto Github PK

mRMRe is a package for Parallelized Minimum Redundancy, Maximum Relevance (mRMR) Ensemble Feature Selection

R 38.19% C++ 61.81%

mrmre's Introduction

mRMRe, a package for Parallelized Minimum Redundancy, Maximum Relevance (mRMR) Ensemble Feature Selection https://cran.rstudio.com/web/packages/mRMRe/index.html

mrmre's People

Contributors

Stargazers

Watchers

Forkers

ndejay leagenx vafisher spapillon apinto-helomics

mrmre's Issues

R-3.3.2 Compilation Warnings

Dear maintainer,

your package mRMRe checks with at least one WARNING or ERROR under R-release (R-3.3.2) for Windows and likely also for other combinations of R versions and platforms, please see:
https://CRAN.R-project.org/web/checks/check_results_mRMRe.html.
Please fix all issues shown on that webpage, check the package yourself following the CRAN policies and submit via the webform.

If you have received other instructions during the last few weeks or submitted a new version during the last 24 hours, please ignore this message.
If the package is not supposed to work under Windows or we have to change settings on winbuilder, please let us know (response may be delayed as we send out hundreds of messages).
You will receive another reminder within few days if your package also fails under R-devel (probably to be R-3.4.0 this spring).

All the best,
Uwe Ligges
(CRAN team)

The first row of scores is NA

I implemented mRMRe.ensemble using exhaustive method and got the first row of scores as NA. My label column is a binary factor ordered with "non-case" < "case".

Thank you for any feedback.

NaNs produced when extracting solutions

system.time(mrmr_4k<-mRMR.classic(data=mrdat_4k, target_indices = 105, feature_count = 10))
f4k<-solutions(mrmr_4k)$105``
Warning message: In log(1 - object@mi_matrix[, target_index, drop = TRUE]^2) : NaNs produced

I have checked the datasets mrmr_4k and mrdat_4k and I can't find any NaNs, NAs or INFs

I've created a dummy dataset that includes NaNs, INFs and NAs, but I can't reproduce this error message.

Do you know the possible reasons that this could occur?

A little question about mRMR and mrmr_selection

I selected features on the same data sets with mRMR packages on R and mrmr_selection on python, and set the same number of n_features, but the results were very different, is this because the calculation method is different in the two packages.

how to control bootstrap subsample size?

Hello.
I'm using this efficient MRMR implementation for feature selection of feature matrices with dimension e. g. 50K x 5K (nSamples x nFeatures). If using method = "bootstrap", how can I control the bootstrap samplesize, e. g. to take only a subset of 10% of the original data during ensemble feature selection?

The pdf contained in the supplemental information prescribes the parameter m, introduced in section "Algorithm 2", but I can't find a corresponding variable in the methods.

Crashing with feature_count > 0

Whenever I try to call mRMR.classic or mRMR.ensemble with feature_count greater than 1, the R session crashes. I'm particularly testing an example with survival data.

Issue in detecting pandas datatypes when I call mRMR.data from Python

Hello,

I am currently using mRMRe for feature selection on a large dataset with over 3 million rows and about 3000 columns (after dummying).
Currently I prepare the csv in Python and read it in into R for applying mRMRe.

However I wanted to do it in Python and hence looked at using rpy2 for the same.

Below is a reproducible example:

import pandas as pd

# Check for dataframes:
df_final = {'NYC': ['0','0','0','0','0','0','0','0','1'],
            'WT': ['0','0','0','0','0','0','0','0','0'],
            'Video': ['0','0','0','0','0','0','0','0','0'],
            'Video/OOL': ['0','0','0','0','0','0','0','0','0'],
            'Video/OOL/OV': ['1','1','0','1','1','1','1','1','0'],
            'Video/OV': ['0','0','0','0','0','0','0','0','0'],
            'OOL,Only': ['0','0','0','0','0','0','0','0','1'],
            'OOL/OV':	['0','0','0','0','0','0','0','0','0'],
            'Bulk':	['0','0','1','0','0','0','0','0','0'],
            'OV Only': ['0','0','0','0','0','0','0','0','0'],
            'class': ['0','0','0','0','1','0','0','1','0']}
df_final = pd.DataFrame.from_dict(df_final)
df_final.dtypes
for column in df_final.columns:
    try:
        df_final[column].astype('int64')
    except:
        df_final.drop(column,axis = 1,inplace = True)

# Now mRMR:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

utils = importr('utils') #-- Only once.
pymrmr = importr('mRMRe')


data = df_final
data = data.select_dtypes(exclude=['bool'])
data_new = data.astype('int64')
data_new.dtypes          
    
pandas2ri.activate()
    
r_df = pandas2ri.py2ri(data_new)
print (r_df)

printing the dataframe produces desired result as shown below:

However when I pass this into R's mRMR.data function :

dd = pymrmr.mRMR_data(data = (r_df))
Traceback (most recent call last):

  File "<ipython-input-53-85d2eb6cf3ea>", line 1, in <module>
    dd = pymrmr.mRMR_data(data = (r_df))

  File "C:\Users\shuvayan.das\AppData\Local\Continuum\anaconda3.1\lib\site-packages\rpy2-2.9.1-py3.6-win-amd64.egg\rpy2\robjects\functions.py", line 178, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)

  File "C:\Users\shuvayan.das\AppData\Local\Continuum\anaconda3.1\lib\site-packages\rpy2-2.9.1-py3.6-win-amd64.egg\rpy2\robjects\functions.py", line 106, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)

RRuntimeError: Error in .local(.Object, ...) : 
  data columns must be either of numeric, ordered factor or Surv type

Since I have explicitly converted everything to integers I do not understand why this error is coming.

One issue is, r_df is rpy2.objects.vector.DataFrame and looks like :

I am not sure if this is the appropriate structure for mRMR.data.

I am also not sure if this issue is from mRMRe or rpy2.
I need to use this through Python only as the code will be used in production. Though there are other ways of achieving that, I would really like to learn what the issue is and fix it if possible.

Thanks,
Shuvayan Das

How to find the optimal feature_count ?

Hi, I have RNA seq data containing over 10K genes and I would like to find the optimal feature fitter the classification model. I am wondering how to find the optimal feature count. Here is my code ,
mrEnsemble <- mRMR.ensemble(data = Xdata, target_indices = c(1) ,feature_count = 100 ,solution_count = 1)
mrEnsemble_genes <- as.data.frame(apply(solutions(mrEnsemble)[[1]], 2, function(x, y) { return(y[x]) }, y=featureNames(Xdata)))
View(mrEnsemble_genes)

I just set feature_count = 100 but I am wondering how to find the optimal feature count for classification without setting the number.
and the result after extracting mrEnsemble_genes will be the list of genes like,
gene05
gene08
gene45
gene67

Are they ranked by score calculated from mutual information? I mean the first ranked gene gain the highest MI and it may be a good gene for classifying the class of sample i.e. cancer and normal, right ? Thank you

Vignette is outdated

The vignette specifies to install using biocLite which is deprecated. Update to BiocManager, which is the replacement package.

drug_IRINOTECAN.RData

Hello,

I am trying to run your R script
prediction_drug.R
from the supplementary info of your paper
De Jay et al. 2013. mRMRe: an R package for parallelized mRMR ensemble feature selection

It fails at the line
load(file.path("drug_IRINOTECAN.RData"))
reporting the file doesn't exist.

Looking at your supplementary info - suppl_info_revised_v2_light.pdf
it says the file can be downloaded from
https://dl.dropboxusercontent.com/u/21216478/drug_IRINOTECAN.RData
However that url reports
Error (404) - We can't find the page you're looking for.

Could you update your supplementary info so I can get the missing file?

regards
Desmond

Forcing a subset of features to be part of the solution

Many application includes a prior knowledge of a set of features that must be part of the solution. Is there a way to modify mRMR to allow for that feature?

A possible way to do it is to calculate the pairwise correlation matrix as usual and then create another correlation matrix between the subset of features of interest and the rest of features. Then, we penalize all the features in the original correlation matrix by subtracting the average correlation of the subset of features of interest with each feature from the second correlation matrix.

Conflict for some functions between Biobase and mRMRe

Sometimes featureNames() function is not recognized ... It looks like it depends on the order one load the libraries.

If one loads genefu and then mRMRe one got the following message:

Attaching package: ‘mRMRe’

The following object is masked from ‘package:Biobase’:

featureData, featureNames, sampleNames

So I guess there is some kind of conflict with Biobase; any idea how to fix this?

Garbage collection issue!

I ran mRMR.ensembl in parallel on a number of cores of a machine for several rounds. It seems that it works fine for the first round but after that the memory usage goes up unexpectedly.

Trouble with my categorical target variable

Hello,

I have a data frame "data" with 60 rows (=samples) and 20228 columns where the first column is my target variable (an ordered factor : 0 or 1) and the other columns are my features (=numeric). I want to do a feature selection with mRMRe in a loop corresponding to a 5-cross-validation that I do 3 times. I select every time 25 features. Here is the problematic part of my code :

library(caret)
library(mRMRe)

data <- read.csv("home/RNA_seq.csv", row.names=1, sep=";", stringsAsFactors=FALSE)
data <- data.frame(t(data))
data[,1] <- factor(data[,1])
data[,1] <- ordered(data[,1], levels = c("0", "1"))

features_select <- list()

r <- 5 # 5-cross-validation
t <- 3 # 5-cross-validation done 3 times
  for (j in 1:t){
    for (i in 1:r){
      #5-cross-validation
      train.index <- createFolds(factor(data$Response), k = 5, list = TRUE, returnTrain = TRUE) 
      datatrain <- data[train.index[[i]],]
      datatest  <- data[-train.index[[i]],]
      
      #Feature selection
      data.mrmre.train <- mRMR.data(data=datatrain)
      res.fs.mrmr <- mRMR.classic(data=data.mrmre.train, target_indices=1, feature_count=25)
      selected.features.mrmre <- mRMRe::solutions(res.fs.mrmr)
      features_select[[((j-1)*r+i)]] <- res.fs.mrmr@feature_names[unlist(selected.features.mrmre)]
      print(features_select[[((j-1)*r+i)]])
      print(res.fs.mrmr)
    }
  }

My problem is that sometimes my target variable called "Response"(=column 1 of "data") is selected by mRMRe as a feature. When this is the case, my target variable "Response" is always selected from feature 2 up to the number requested (here 25). For example :

features_select :

[[1]]
[1] "AC137800.2" "AC007387.1" "AC079354.1" "AC145138.1" "RNA5SP370" 
[6] "RNA5SP219"  "AL022324.1" "AC023449.1" "AP000873.1" "AC020612.2"
[11] "RNA5SP473"  "AC092810.1" "IGKV1D.37"  "SST"        "AC093331.1"
[16] "TRAJ34"     "AC107983.1" "RPL39P"     "HSBP1P1"    "TRBJ1.6"   
[21] "PHGR1"      "RNA5SP435"  "RNA5SP301"  "AC005255.1" "KRT127P"

[[2]]
 [1] "AC073869.8"   "Response" "Response" "Response" "Response" "Response"
 [7] "Response" "Response" "Response" "Response" "Response" "Response"
[13] "Response" "Response" "Response" "Response" "Response" "Response"
[19] "Response" "Response" "Response" "Response" "Response" "Response"
[25] "Response"

Here is the output of the function mRMR.classic() corresponding to the 2 features sets above.

[[1]]
Formal class 'mRMRe.Filter' [package "mRMRe"] with 8 slots
  ..@ filters       :List of 1
  .. ..$ 1: int [1:25, 1] 18837 18781 15503 15526 17437 20028 18924 17133 17024 16104 ...
  ..@ scores        :List of 1
  .. ..$ 1: num [1:25, 1] 0.817 0.819 0.817 0.817 0.817 ...
  ..@ mi_matrix     : num [1:20228, 1:20228] NA -0.3786 -0.1536 -0.0929 -0.0964 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ causality_list:List of 1
  .. ..$ 1: num [1:20228] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
  ..@ sample_names  : chr [1:48] "Pt1_28" "Pt2_28" "Pt4_28" "Pt5_28" ...
  ..@ feature_names : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ target_indices: int 1
  ..@ levels        : int [1:25] 1 1 1 1 1 1 1 1 1 1 ...

[[2]]
Formal class 'mRMRe.Filter' [package "mRMRe"] with 8 slots
  ..@ filters       :List of 1
  .. ..$ 1: int [1:25, 1] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ scores        :List of 1
  .. ..$ 1: num [1:25, 1] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ mi_matrix     : num [1:20228, 1:20228] NA -0.518 -0.246 -0.211 -0.204 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ causality_list:List of 1
  .. ..$ 1: num [1:20228] NA NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
  ..@ sample_names  : chr [1:48] "Pt1_28" "Pt2_28" "Pt4_28" "Pt5_28" ...
  ..@ feature_names : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ target_indices: int 1
  ..@ levels        : int [1:25] 1 1 1 1 1 1 1 1 1 1 ...

This doesn't appear every time for the same value of i and j into the loop. Do you have an idea where is the problem ?

Thank you in advance !

Package ‘mRMRe’ was removed from the CRAN repository

Dear developers,
Is this temporary or permanent? Thanks.

mRMR.class type of target column for binary classification

Hello:

I have a problem when I used 'mRMR.classic' on my own dataset. I would like to use MRMR feature selection before doing a binary classification. So, my target column contains '0' or '1' (class labels). May I ask does it matter whether I set the type of the target column to be 'ordered factor' or 'numeric'? Since I tried to set target column to be either 'numeric' or 'ordered factor', but I got different results of the selected features. Which type of target column should I set?

Thanks a lot for your help!

Here is the code

set the type of target column to be 'ordered factor':

df=data.frame(traget=factor(ylabel,ordered = TRUE),data.matrix(x))
sapply(df, class)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.classic(data = f_data, target_indices = 1,
feature_count = 30)

set the type of target column to be 'numeric':

df=data.frame(traget=ylabel,data.matrix(x))
sapply(df, class)
df <- transform(df, traget = as.numeric(traget))
sapply(df, class)
f_data1 <- mRMR.data(data = data.frame(df))
featureData(f_data1)
mRMR.classic(data = f_data1, target_indices = 1,
feature_count = 30)

Does mRMRe support unordered factor?

Hello,

I was wondering if the algorithm in mRMRe are designed to handle un-ordered factor variable in the data set?

Thank you

Redundancy of functions

There are two functions described in the reference manual which seems to do exactly the same thing. These functions are called get_thread_count and get.thread.count.

Discrepancy in scoring

I have been assessing your package for use and comparisons with my own manual calculations. I have hit a point where I cannot understand the logic of the scoring conventions performed in this package. Specifically, this line of code is a head scratcher.
setMethod("scores", signature("mRMRe.Filter"), function(object)
{
mi_matrix <- mim(object)
targets <- as.character(target(object))
scores <- lapply(targets, function(target) {
apply(solutions(object)[[target]], 2, function(solution) {
sapply(1:length(solution), function(i) {
feature_i <- solution[i] #feature loc
if(i == 1)
return(mi_matrix[as.numeric(target), feature_i]) #initialize
ancestry_score <- mean(sapply((i-1):1, function(j) mi_matrix[feature_i, solution[j]]))
return(mi_matrix[as.numeric(target), feature_i] - ancestry_score)
})

names(scores) <- targets
return(scores)
})

when I utilize the continuous_estimator variable, I see that the mi_matrix returned is from the spearman's correlation matrix, but it fails to have scores which match these values, instead returning the scores for a MUTUAL INFORMATION matrix, instead of the continuous correlation matrix. I.E:
ensembl <- mRMR.classic(data = dd1, target_indices = c(1), feature_count =10,continuous_estimator='spearman')
scores(ensembl) #This will return the scores related to the mutual information matrix not the correlationmatrix scores

Is there a way other than directly pulling your git hub code to get the scores to match for the correlation matrix

cannot install package: package does not exist in CRAN

I am having difficulty with installing your package. It has been deleted in CRAN.

package ‘mRMRe’ is not available (for R version 3.4.3)

i tried installing it using "remotes" package, but again to no avail.

The downloaded binary packages are in
C:\Users\aahmadi\AppData\Local\Temp\RtmpgXztq4\downloaded_packages

library(remotes)
Warning message:
package ‘remotes’ was built under R version 3.4.4
install_github("cran/mRMRe")
Downloading GitHub repo cran/mRMRe@master
These packages have more recent versions available.
Which would you like to update?

1: pkgconfig (2.0.1 -> 2.0.2) [CRAN]

Enter one or more numbers separated by spaces, or an empty line to cancel
1: 1
igraph (NA -> 1.2.2) [CRAN]
pkgconfig (2.0.1 -> 2.0.2) [CRAN]
Installing 2 packages: igraph, pkgconfig
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/igraph_1.2.2.zip'
Content type 'application/zip' length 8316376 bytes (7.9 MB)
downloaded 7.9 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/pkgconfig_2.0.2.zip'
Content type 'application/zip' length 19967 bytes (19 KB)
downloaded 19 KB

package ‘igraph’ successfully unpacked and MD5 sums checked
package ‘pkgconfig’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\aahmadi\AppData\Local\Temp\RtmpgXztq4\downloaded_packages
Running R CMD build...

checking for file 'C:\Users\aahmadi\AppData\Local\Temp\RtmpgXztq4\remotes14f0569f70fa\cran-mRMRe-9f13944/DESCRIPTION' ... OK
preparing 'mRMRe':
checking DESCRIPTION meta-information ... OK
cleaning src
checking vignette meta-information ... OK
checking for LF line-endings in source and make files and shell scripts
checking for empty or unneeded directories
looking to see if a 'data/datalist' file should be added
building 'mRMRe_2.0.7.tar.gz'
installing source package 'mRMRe' ...
** libs

*** arch - i386
Warning: running command 'make -f "Makevars" -f "C:/PROGRA~~1/R/R-34~~1.3/etc/i386/Makeconf" -f "C:/PROGRA~~1/R/R-34~~1.3/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="mRMRe.dll" OBJECTS="Data.o Filter.o Math.o Matrix.o MutualInformationMatrix.o exports.o"' had status 127
ERROR: compilation failed for package 'mRMRe'

removing 'C:/Program Files/R/R-3.4.3/library/mRMRe'
In R CMD INSTALL
Error: (converted from warning) running command '"C:/PROGRA~~1/R/R-34~~1.3/bin/x64/R" CMD INSTALL -l "C:\Program Files\R\R-3.4.3\library" "C:/Users/aahmadi/AppData/Local/Temp/RtmpgXztq4/file14f042e12d98/mRMRe_2.0.7.tar.gz"' had status 1
In addition: Warning messages:
1: In untar2(tarfile, files, list, exdir) :
skipping pax global extended headers
2: In untar2(tarfile, files, list, exdir) :
skipping pax global extended headers
3: In missing_devel_warning(pkgdir) :
Package mRMRe has compiled code, but no suitable compiler(s) were found. Installation will likely fail.
Install Rtools (https://cran.r-project.org/bin/windows/Rtools/).Then use the pkgbuild package, or make sure that Rtools in the PATH.

library(mRMRe)
Error in library(mRMRe) : there is no package called ‘mRMRe’
install.packages("mRMRe")
Warning in install.packages :
package ‘mRMRe’ is not available (for R version 3.4.3)

I found the most recent version and unzipped it and copied the folder in my
C:\Program Files\R\R-3.4.3\library\mRMRe
but i got the message that it was not installed properly.

The return variables contain the response variable

I am a user of your package and recently find that when I increase the ensemble number (e.g. solution count), some of the selected variables include the response varible, which is undesirable. Is there a way to fix it?

Falsely symmetrical kendall / c-index mutual information matrix

In this extreme example (made to simulate 0 inflated count data) where one variable as a lot more "zeros", the kendall estimate (implemented as the (c_index - 0.5) * 2 in mRMRe) is different between (X,Y) and (Y,X). In other words, it is an asymmetric estimator.

However, the implementation of mim() returns a symmetric matrix whereas the calls to correlate() are very different.

library(mRMRe)
set.seed(0)
dataset = data.frame(X = as.numeric(rbinom(10000, 10, 0.1)),
                     Y = as.numeric(rbinom(10000, 10, 0.5)))

mRMRe_dataset = mRMR.data(dataset)

mim(mRMRe_dataset, method = "cor", continuous_estimator = "kendall")

correlate(dataset$Y, dataset$X, method = "kendall")
correlate(dataset$X, dataset$Y, method = "kendall")

Register native routines

The Date field is over a month old. [BHK can you try -resave-data or something like this when you build?]

checking top-level files ... NOTE
Non-standard file/directory found at top level: [BHK: weird, worst case scenario, we can remove it by end in the tar.gz]
‘unitest'
checking compiled code ... NOTE
File 'mRMRe/libs/i386/mRMRe.dll':
Found no calls to: 'R_registerRoutines', 'R_useDynamicSymbols'
File 'mRMRe/libs/x64/mRMRe.dll':
Found no calls to: 'R_registerRoutines', 'R_useDynamicSymbols'

It is good practice to register native routines and to disable symbol
search.

See 'Writing portable packages' in the 'Writing R Extensions' manual. [BHK I am not sure about that one]

questions on mRMRe

I am trying this code on a rather small data created by myself. The data consist of 10 covariates and 100 observations. The first three are correlated among themselves and the first 5 are associated with outcome. The outcome is a survival object created by Surv. The survival object is in the first place in the data.
I am trying to use it and I have the following questions.

I am mostly interested in mRMR.ensample with method=’bootstrap’. My understanding is that solution_count parameter will implicitly set the number of boostrap resampling done. Since the survival object is on the first place in the data the target_indices=1. At this point I am only asking for feature_count=3. The question is why does this code does not work for a solution count of 125. It works for 100 but not for 125.

b=mRMR.ensemble(data = mydd, target_indices = c(1),method='bootstrap',solution_count=100,feature_count = 3)
b=mRMR.ensemble(data = mydd, target_indices = c(1),method='bootstrap',solution_count=125,feature_count = 3)
Error in .local(.Object, ...) :
user cannot request for more solutions than is possible given the data set

It seems that once it chooses one solution it gets stuck with it. I f I ask for 10 solutions and I run it twice I get:

b=mRMR.ensemble(data = mydd, target_indices = c(1),method='bootstrap',solution_count=10,feature_count = 3)
solutions(b)
$1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 3 3 3 3 3 3 3 3 3
[2,] 5 5 5 5 5 5 5 5 5 5
[3,] 6 7 7 7 7 7 7 7 7 7
b=mRMR.ensemble(data = mydd, target_indices = c(1),method='bootstrap',solution_count=10,feature_count = 3)
solutions(b)
$1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 2 2 2 2 2 2 2 2 2
[2,] 5 5 5 5 5 5 5 5 5 5
[3,] 6 9 9 9 9 9 9 9 9 9
But if I run it asking for 20 solutions I get:
b=mRMR.ensemble(data = mydd, target_indices = c(1),method='bootstrap',solution_count=20,feature_count = 3)
solutions(b)
$1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[2,] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
[3,] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
I would have expected to get a mixture of the two in all three instances.

Of a lesser importance is the following. When I run mRMR.classic and I am asking for more features than possible I obtain the outcome among the chosen features and it gets repeated until the number I asked for is filled. Here I have 10 covariates and I am asking for 13. Here it is what I get:

a=mRMR.classic(data = mydd, target_indices = c(1),feature_count = 13)
solutions(a)
$1
[,1]
[1,] 3
[2,] 5
[3,] 6
[4,] 2
[5,] 8
[6,] 7
[7,] 4
[8,] 11
[9,] 10
[10,] 9
[11,] 1
[12,] 1
[13,] 1
This is not a problem in itself but it becomes one when one runs simulations and the process becomes automatic. I am just wondering if this can be fixed.

thank you

13 February 2018
Thank you for looking into this issue. I have tried and it seems that selection differs now from run to run.

However the solution_count still has a limit on ow many i can ask:
b=mRMR.ensemble(data = mydd, target_indices = c(1),solution_count=300,feature_count = 7)
Error in .local(.Object, ...) :
user cannot request for more solutions than is possible given the data set
Also when the solution_count is above a certain number the first solutions contain only 1 which in my case is the outcome.

b=mRMR.ensemble(data = mydd, target_indices = c(1),solution_count=30,feature_count = 7)
solutions(b)
$1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
[1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 10
[2,] 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 11 3 3
[3,] 1 1 1 1 1 1 1 1 1 1 5 6 5 3 2 3 3 3 3 3 5 5
[4,] 1 1 1 1 1 1 1 1 1 1 8 8 6 6 5 5 5 5 5 5 2 2
[5,] 1 1 1 1 1 1 1 1 1 1 6 2 2 4 3 2 2 2 2 2 6 6
[6,] 1 1 1 1 1 1 1 1 1 1 7 11 8 8 9 6 6 8 6 6 8 8
[7,] 1 1 1 1 1 1 1 1 1 1 4 10 10 2 10 10 7 6 8 8 4 11
[,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
[1,] 8 11 7 6 5 4 2 3
[2,] 3 3 3 2 6 5 5 5
[3,] 5 5 5 11 8 6 6 6
[4,] 2 2 2 3 3 3 8 2
[5,] 7 7 6 5 7 8 4 8
[6,] 10 6 10 8 10 7 3 7
[7,] 4 8 4 4 2 10 10 4

Thank you again. Sorry for being so insistent but I think this function has a lot of potential to help me.
Best regards.
Melania

Sensitivity to the number of features!?

It seems that mRMR.ensembl has some issue with the indices. I have an input file with 19k features and I use mRMR.ensembl to give me a solution with 100 features and it works fine. But when I subset the input matrix to 1000 features and ask for a solution with 100 features, it crashes.

Negative weights for Importance

Hello,

can you please tell me what -ve values for importance column when finding out feature weights mean. I am running a classification problem.
Do they mean that the variable is not important for the model like in Random forest
OR
is it to do with correlation to the class variable because dependent variable being ordered factors(A, B, C, D) ,the biggest -ve value of a variable more importance in class D and biggest +ve value of a variable for class A.

when using
selected_features <- unique(as.vector(solutions(feature_new)[[1]]))
variableImportance <- data.frame(importance = feature_new@mi_matrix[selected_features,1])
variableImportance$feature <- rownames(variableImportance)
row.names(variableImportance) <- NULL
variableImportance <- na.omit(variableImportance)
variableImportance <- variableImportance[order(variableImportance$importance, decreasing=TRUE),]
variableImportance

Results

importance | feature
0.300238634 | xxx45
0.299554566 | xxx23
0.299273488 | xxx45
-0.300668151 | xxx23
-0.33518931 | xxx22
-0.462138085 | xxx11
-0.496659243 | xxx1

Selecting feature subset from a feature set using mRMRe R package

Hi,
As per suggestion in the email reply of Dr. Benjamin Haibe-Kains, I am creating an issue regarding my query. Please excuse if the question is simple as I am new in R. Below is the detail.

Suppose, I have a csv file gene.csv (CSV file is attached as zip file- gene.zip) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's the sample gene.csv file (see attached zip file):

[G1.1.1.1]	[G1.1.1.2]	[G1.1.1.3]	   [G1.1.1.4]	[G1.1.1.5]	[G1.1.1.6]  [Output]
11.688312	0.974026	4.87013	   7.142857	3.571429	10.064935    -1
12.538226	1.223242	3.669725	   6.116208	3.363914	9.174312       1
10.791367	0.719424	6.115108	   6.47482	3.597122	10.791367    -1
13.533835	0.37594	6.766917	   7.142857	2.631579	10.902256     1
9.737828	2.247191	5.992509	   5.992509	2.996255	8.614232      -1
11.864407	0.564972	7.344633	   4.519774	3.389831	7.909605      -1
11.931818	0	        7.386364	   5.113636	3.409091	6.818182       1
16.666667	0.333333	7.333333	   4.333333	2	        8.333333      -1

I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.

library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7, 
              feature_count = 2, solution_count = 1)

When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):

Error in .local(.Object, ...) : 
  data columns must be either of numeric, ordered factor or Surv type

However, my data in each column of the csv file are real number. So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7, feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.

I will appreciate much if you kindly help me to obtain the best feature subset based on the gene.csv file using your mRMRe R package.

Thank you very much.

Sincerely,
Abu

dependent(ordered factor) variable is coming as a feature when doing mRMR.ensemble

Hello,

I have a classification problem , I have a combination of ordered columns and numeric fields as my independent variables. My dependent variable is a ordered factor in binary format (Yes/no)[1/2]

I have tried using the example of
TitanicDF using the link https://amunategui.github.io/variable-importance-shuffler/ where all the variables are numeric which works fine.
CGPS using this link https://cran.r-project.org/web/packages/mRMRe/vignettes/mRMRe.pdf where the dependent variable cgps.ic50 is coming up with the other selected features, is it the expected result ?.

My problem is similar to the CGPS Dataset, here is the final feature selection (variableImportance) the dependent variable is also coming up, the position of the dependent variable is at index 1 (first line of code) and feature count is 30 including the dependent variable To counter this I have written a line of code highlighted in yellow. My question is , it the right approach ?

Then If I append the dependent variable as the last column of the dataset and in target_indicies give ncol(dataset) and then run the below code I get all the variables of the dataset (97 columns in count minus the dependent variable) the features count does not stop at 30. So then what I do is to filter the features (line highlighted in gray). If this is the right approach can you please suggest a threshold to filter the features with like 0.5 etc.

ds_fe <- data.frame( CLASS=ds_fe$CLASS, ds_fe[-(1:7)])
data <- mRMR.data(data = ds_fe)
feature_new <- mRMR.ensemble(data = data, target_indices = c(1), feature_count = 30, solution_count = 5)
variableImportance <-data.frame('importance'= feature_new@mi_matrix[nrow(feature_new@mi_matrix),])
variableImportance$feature <- rownames(variableImportance)
row.names(variableImportance) <- NULL
variableImportance <- na.omit(variableImportance)
variableImportance <- variableImportance[variableImportance$feature != "CLASS",] ### removing the dependent variable
variableImportance <- variableImportance[variableImportance$importance > 0.1 | variableImportance$importance < -0.1, ]
print(variableImportance)

bhklab / mrmre Goto Github PK

mrmre's Introduction

mrmre's People

Contributors

Stargazers

Watchers

Forkers

mrmre's Issues

Recommend Projects

Recommend Topics

Recommend Org