lacava / few Goto Github PK

View Code? Open in Web Editor NEW

51.0 51.0 22.0 607 KB

a feature engineering wrapper for sklearn

Home Page: https://lacava.github.io/few

License: GNU General Public License v3.0

Python 92.36% Shell 2.00% C++ 5.42% Makefile 0.22%

few's People

Contributors

Stargazers

Watchers

Forkers

arita37 erp12 codeaudit rgupta90 epistasislab dophan batermj qriner echo66 sunbc0120 ryan102590 sandy4321 zqhfpjlswsqy dionresearch seanigami harel-coffee

few's Issues

adding new node types

Greetings!

How can we add custom node types? Is it (currently) possible?

If original features are found by FEW, transform() method fails with TypeError

eg:

print('Model: {}'.format(learner.print_model()))

Model: original features

Phi = learner.transform(X_test.values)

TypeError Traceback (most recent call last)
in ()
----> 1 Phi = learner.transform(X_test.values)

~/anaconda3/envs/ml/lib/python3.6/site-packages/FEW-0.0.38-py3.6-macosx-10.7-x86_64.egg/few/few.py in transform(self, x, inds, labels)
395 # return np.asarray(Parallel(n_jobs=10)(delayed(self.out)(I,x,labels,self.otype) for I in self._best_inds)).transpose()
396 return np.asarray(
--> 397 [self.out(I,x,labels,self.otype) for I in self._best_inds]).transpose()
398
399

TypeError: 'NoneType' object is not iterable

feature importance determines who survives

implement truncation selection based on feature scores. if none, use estimated fitness

write DistanceClassifier()

write DistanceClassifier as separate sklearn learner

Error with installation

Hello,

Trying to attempt this package but running into some issues, any idea? I have VS 14.16 now on PC and getting this error when typing 'pip install few'. At first it was asking for eigency but now after that installation this error popped up.

Sincerely,
G

installing few with pip

it cannot be installed with pip since you are importing eigency inside setup.py and the requirements are not installed (yet)!

add balanced accuracy option

add option to assess ML model based on balanced accuracy.

feature importance determines parent pressure for selection

add this configuration option.

low GPU utilization with tf option

I'm getting low utilization of the GPU using the tensorflow evaluation strategy. There are a few things to try:

use this method to profile tensorflow and see where the inefficiencies lay.
according to this, using feed_dict is not a good idea. need to look into using pipelines or variables for feeding input data to the graphs.

use check_random_state() instead of numpy.seed

Currently, few does not accept an instance of numpy.random as random state.

Issues with current ML validation score

Hello,

Thanks for the help so far. I was able to get the tool up and running in windows.

However, 2 weird things I am observing.

When I use Gradient Boost Regressor - my score gets worse by the generation even when I switched the scoring function sign. The first score is nearly my best score I have gotten by myself (no feature engineering done on data set).

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_GB.ipynb

When I use Random Forest - same scorer - current ML validation score returns as 0 and runs really fast

https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/FEW_RF.ipynb

I think I am missing something on how to use this tool but no idea what. I am trying to use this in tandem with TPOT as I am exploring feature creation GA/GP based tools. Sincerely appreciate any advice/guidance you can provide.

Sincerely,
G

add length-normalized probability of mutation

replace point mutation with a length-based probability of mutation at each node.

add size mediated parent selection for lexicase survival

include option to mediate lexicase survival via size of programs for each selection event.

move mad calculation outside of lexicase population loop

case statistics should just be calculated once rather than each selection event.

GridSearchCV error

Concerning errors of the form

self.ml.named_steps = undefined
    205                   hasattr(self.ml.named_steps['ml'],'feature_importances_')):
    208                 coef = (self.ml.named_steps['ml'].coef_ if
AttributeError: 'SGDClassifier' object has no attribute 'named_steps'

when using FEW in GridSearchCV while changing the ML parameter. The pipeline object needs to be redefined in the fit method so that GridSearch can change self.ml and the pipeline gets updated.

stall effects

track stalling in runs and act based on them.
stalling occurs when the engineered features are no longer improving either 1) the ML model CV score or 2) the median fitness of the features themselves.
if stalling occurs, there should be options to

exit
modify search to capture a different part of the search space. this could be achieved by increasing complexity of the features, increasing variation steps, or lowering selection presure.

ImportError: dlopen ... symbol not found

Hi, I've cloned few, built and installed on OS X 10.12 using:

CC=gcc-7 python setup.py install

But I'm getting a symbol not found error on import of the few module.

I note a few warnings during the build process beginning with: #warning "Using deprecated NumPy API, disable it by ...

and then finally:

g++ -bundle -undefined dynamic_lookup -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -L/Users/robertreynolds/anaconda3/envs/ml/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/few/lib/few_lib.o -o build/lib.macosx-10.7-x86_64-3.6/few_lib.cpython-36m-darwin.so
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]

Any advice what to check next?
Otherwise, I'm not entirely clear on why I'm seeing a clang message, so that, along with the indicated warning is my first avenue to explore.

random numbers seed not working?

Greetings!

I have the following code:

feats_gen = FEW(
                ml=DecisionTreeClassifier(random_state=10, max_depth=None, min_samples_leaf=5), 
                population_size=100, tourn_size=2,                 
                mutation_rate=0.5, crossover_rate=0.5, 
                sel='epsilon_lexicase',   
                clean=True,                
                mdr=True, boolean=True, 
                random_state=10, verbosity=1, 
                scoring_function=roc_auc_score, 
                max_depth=10, min_depth=1, max_depth_init=1, 
                classification=True, 
                generations=50, max_stall=None, 
                names=list(X_train.select_dtypes(include=[np.number]).columns))

feats_gen.fit(X_train.select_dtypes(include=[np.number]).values, 
              y_train.astype(int).values)

test_ = preprocessing_pipeline.transform(e.test)

X_test = test_.X
y_test = test_[test_.target_name].astype(int)

roc_auc_score(y_test, feats_gen._best_estimator.predict_proba(feats_gen.transform(X_test.select_dtypes(include=[np.number]).values))[:, 1])

Everytime I run this code, I get different ROC AUC values in both training and test. I'm pretty sure preprocessing_pipeline is deterministic.

implement 3-fold cross validation for internal updating of best model

currently the training data is split into training and validation sets and the best model is updated when a model with a higher validation score is found. we could simplify quite a bit and have a more robust validation measure by removing train_test_split and the associated numpy arrays / fitting predicting code with a direct call to cross_val_score(self.ml,features,labels,cv=3) or cross_val_score(self.ml,self.X[self.valid_loc(),:].transpose(),labels,cv=3).

value error in lasso

occasional error:
File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/home/bill/anaconda3/lib/python3.5/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 506, in
main()
File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 495, in main
learner.fit(training_features, training_labels)
File "/media/bill/Drive/Dropbox/PostDoc/code/few/few/few.py", line 181, in fit
self.ml.fit(pop.X.transpose(),y_t)
File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 1132, in fit
Lars.fit(self, X, y)
File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 671, in fit
return_n_iter=True, positive=self.positive)
File "/home/bill/anaconda3/lib/python3.5/site-packages/sklearn/linear_model/least_angle.py", line 260, in lars_path
sign_active[n_active] = np.sign(C)
ValueError: cannot convert float NaN to integer

should not occur due to safe operator outputs

few.model() and few.print_model()

Hello!

Thanks for sharing your work, this is really cool!

I was wondering if you could provide a bit of explanation as to the difference between these two outputs of the algorithm.

Also, is there any (outside) documentation on all this?

Thanks in advance!

Kind regards,
Theodore.