I am currently using mRMRe for feature selection on a large dataset with over 3 million rows and about 3000 columns (after dummying).
Currently I prepare the csv in Python and read it in into R for applying mRMRe.
However I wanted to do it in Python and hence looked at using rpy2 for the same.
import pandas as pd
# Check for dataframes:
df_final = {'NYC': ['0','0','0','0','0','0','0','0','1'],
'WT': ['0','0','0','0','0','0','0','0','0'],
'Video': ['0','0','0','0','0','0','0','0','0'],
'Video/OOL': ['0','0','0','0','0','0','0','0','0'],
'Video/OOL/OV': ['1','1','0','1','1','1','1','1','0'],
'Video/OV': ['0','0','0','0','0','0','0','0','0'],
'OOL,Only': ['0','0','0','0','0','0','0','0','1'],
'OOL/OV': ['0','0','0','0','0','0','0','0','0'],
'Bulk': ['0','0','1','0','0','0','0','0','0'],
'OV Only': ['0','0','0','0','0','0','0','0','0'],
'class': ['0','0','0','0','1','0','0','1','0']}
df_final = pd.DataFrame.from_dict(df_final)
df_final.dtypes
for column in df_final.columns:
try:
df_final[column].astype('int64')
except:
df_final.drop(column,axis = 1,inplace = True)
# Now mRMR:
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
utils = importr('utils') #-- Only once.
pymrmr = importr('mRMRe')
data = df_final
data = data.select_dtypes(exclude=['bool'])
data_new = data.astype('int64')
data_new.dtypes
pandas2ri.activate()
r_df = pandas2ri.py2ri(data_new)
print (r_df)
dd = pymrmr.mRMR_data(data = (r_df))
Traceback (most recent call last):
File "<ipython-input-53-85d2eb6cf3ea>", line 1, in <module>
dd = pymrmr.mRMR_data(data = (r_df))
File "C:\Users\shuvayan.das\AppData\Local\Continuum\anaconda3.1\lib\site-packages\rpy2-2.9.1-py3.6-win-amd64.egg\rpy2\robjects\functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "C:\Users\shuvayan.das\AppData\Local\Continuum\anaconda3.1\lib\site-packages\rpy2-2.9.1-py3.6-win-amd64.egg\rpy2\robjects\functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
RRuntimeError: Error in .local(.Object, ...) :
data columns must be either of numeric, ordered factor or Surv type
Since I have explicitly converted everything to integers I do not understand why this error is coming.
I am not sure if this is the appropriate structure for mRMR.data.
I am also not sure if this issue is from mRMRe or rpy2.
I need to use this through Python only as the code will be used in production. Though there are other ways of achieving that, I would really like to learn what the issue is and fix it if possible.