rasbt / mlxtend Goto Github PK

View Code? Open in Web Editor NEW

4.8K 117.0 845.0 94.76 MB

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Home Page: https://rasbt.github.io/mlxtend/

License: Other

Python 99.70% TeX 0.30%

python machine-learning data-science data-mining association-rules supervised-learning unsupervised-learning

mlxtend's Issues

Sequential Feature Selection and GridSearch : access the indices of the best features

Hello,

This is a question.

I am using example 7 in:

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

I would like to access the indices of the selected features. When I use:

sfs1.k_feature_idx_

as shown in example 1, I get the error:

AttributeError                            Traceback (most recent call last)
<ipython-input-76-92f3d7e599b6> in <module>()
----> 1 sfs1.k_feature_idx_

AttributeError: 'SequentialFeatureSelector' object has no attribute 'k_feature_idx_'

I also tried to use:

gs.transform()

but got the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-102-98472ed5822d> in <module>()
----> 1 gs.transform()

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/utils/metaestimators.py in __get__(self, obj, type)
     33             # delegate only on instances, not the classes.
     34             # this is to allow access to the docstrings.
---> 35             self.get_attribute(obj)
     36         # lambda, but not partial, allows help() to work with update_wrapper
     37         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/utils/metaestimators.py in __get__(self, obj, type)
     33             # delegate only on instances, not the classes.
     34             # this is to allow access to the docstrings.
---> 35             self.get_attribute(obj)
     36         # lambda, but not partial, allows help() to work with update_wrapper
     37         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

AttributeError: 'KNeighborsClassifier' object has no attribute 'transform'

So how do I access this information without having to search for the features again?

TIA

P.S: I requested membership to the Gougle group but I have not had a reply and it does not seem to be used.

Probably a good idea to move to py.test or plain unittest

I've been a nose user for a long, long time, and yeah, I know it's been kind of "abandoned" for some time. However, I am also a big fan of the popular saying "If it ain't broke, don't fix it," since it can be a time sink in disfavor of more interesting, valuable, and useful improvements elsewhere. On the other hand, as long as this project is rather "small" and as long as a transition (from nose to py.test) is still "reasonably doable", I think it's better to switch over to py.test rather sooner than later ...

The reason why I bring this up now is that I just stumbled upon today's release of py.test 3.0 ver. And as a reflex, I checked the current status on nose ... unfortunately, it still says:

Nose has been in maintenance mode for the past several years and will likely cease without a new person/team to take over maintainership. New projects should consider using Nose2, py.test, or just plain unittest/unittest2.

X_highlight not working

Hi, I am new to this lib and have tried all but nothing seems to make this to work. I even copy directly the example 4 into my notebook and it won't work, it won't highlight samples
I am using python 3.6 and this shows in mlxtend version

➜  Downloads python3.6 -m pip show mlxtend
Name: mlxtend
Version: 0.5.1
Summary: Machine Learning Library Extensions
Home-page: https://github.com/rasbt/mlxtend
Author: Sebastian Raschka
Author-email: [email protected]
License: BSD 3-Clause
Location: /usr/local/lib/python3.6/site-packages
Requires: numpy, scipy

This is the example

from mlxtend.plotting import plot_decision_regions
from mlxtend.preprocessing import shuffle_arrays_unison
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.svm import SVC


# Loading some example data
iris = datasets.load_iris()
X, y = iris.data[:, [0,2]], iris.target
X, y = shuffle_arrays_unison(arrays=[X, y], random_seed=3)

X_train, y_train = X[:100], y[:100]
X_test, y_test = X[100:], y[100:]

# Training a classifier
svm = SVC(C=0.5, kernel='linear')
svm.fit(X_train, y_train)

# Plotting decision regions
plot_decision_regions(X, y, clf=svm, res=0.02, 
                      legend=2, X_highlight=X_test)

# Adding axes annotations
plt.xlabel('sepal length [cm]')
plt.ylabel('petal length [cm]')
plt.title('SVM on Iris')
plt.show()

verbose on EnsembleClassifier

Hi, any plans to include a verbose option. I modify the code just to know which clf is being trained, but is far from ideal

Feature request for sequential feature selector: random subsets at each step

Hello,

First of all: thanks for the great package! I have gotten a lot of good use out of it, especially the sequential feature selection.

SFS becomes problematic as the number of features d increases, since the complexity grows as O(d^2). I have found that one way to deal with this is to take a random subset of the remaining dimensions to check at each step instead of trying all of them. If the random subset has size k then the complexity goes down to O(dk).

Take an example of sequential forward selection with d=1000 and k=25.

During the first step, we can either try all 1000 univariate models or pick a random subset of 25 univariate models, and then take the best of them. It makes sense to try them all so as to start with a good baseline.

During the second step, instead of trying 999 bivariate models, we try only 25 of them.

Then 25 instead of 998 trivariate models. And so on until we have 25 left, at which point we revert to trying them all.

If you're interested in some empirical results, I wrote a blog post about this a while back: http://blog.explainmydata.com/2012/07/speeding-up-greedy-feature-selection.html

This would be a great feature to have!

tensorflow variable initialization with deprecate in March 2017

All the tf.initialize_all_variables() calls need to be changed to tf.global_variables_initializer() by March 2017 to maintain compatibility with tensorflow.

Make sure mlxtend is scikit-learn 0.18 compatible

Add continuous integration tests for scikit-learn 0.18 and ensure the mlxtend library is compatible to the recently released version of scikit-learn

verbose=1 throws AttributeError: 'StackingClassifier' object has no attribute 'clf_'

First of all thank you Sebastian for the great work on this library.

Here's my code:

clf1 = RandomForestClassifier(n_estimators=10, n_jobs=-1, criterion='gini',random_state=42)

clf2 = RandomForestClassifier(n_estimators=10, n_jobs=-1, criterion='entropy',random_state=42)

clf3 = ExtraTreesClassifier(n_estimators=10,random_state=42,n_jobs=-1)

clf4 = ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='entropy',random_state=42)

clf5 = GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, 
max_depth=6, n_estimators=10,random_state=42)

lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1,clf2,clf3,clf4,clf5], 
                          meta_classifier=lr,use_probas=True)

print('3-fold cross validation:\n')

for clf, label in zip([clf1,clf2,clf3,clf4,clf5,sclf], 
                      ['RF Gini', 
                       'RF Entropy',
                       'Xtra Tree 1',
                       'Xtra tree 2',
                       'Grad Boost',
                       'StackingClassifier']):

    scores = cross_validation.cross_val_score(clf, train_no_events, y, 
                                              cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" 
          % (scores.mean(), scores.std(), label))

It all goes well like this but if I add verbose=1 to StackingClassifier() I get a traceback:

3-fold cross validation:

Accuracy: 0.63 (+/- 0.01) [RF Gini]
Accuracy: 0.63 (+/- 0.01) [RF Entropy]
Accuracy: 0.63 (+/- 0.01) [Xtra Tree 1]
Accuracy: 0.63 (+/- 0.01) [Xtra tree 2]
Accuracy: 0.64 (+/- 0.00) [Grad Boost]
Fitting 5 classifiers...
Traceback (most recent call last):

  File "<ipython-input-8-29daf2b04330>", line 21, in <module>
    cv=3, scoring='accuracy')

  File "C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py", line 1433, in cross_val_score
    for train, test in cv)

  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):

  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)

  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)

  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
    self.results = batch()

  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]

  File "C:\Anaconda3\lib\site-packages\sklearn\cross_validation.py", line 1531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)

  File "C:\Anaconda3\envs\devel\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 95, in fit
    (i, _name_estimators((clf,))[0][0], i, len(self.clf_)))

AttributeError: 'StackingClassifier' object has no attribute 'clf_'

Thanks!

Take the best performance during a Sequential Feature Selector with Pipeline process

Hi Sebastian,

I posted an issue after some tweet with you (I hope it could help other people).

I would like to perform a Sequential Feature Selector (SFS) with Pipeline. But at the end of the process, SFS takes SFS.k_features (25 for this exemple) :

clf1 = LogisticRegression(class_weight='balanced', solver='newton-cg', C=100.0, random_state=17)

sfs1 = SFS(clf1, 
           k_features=25, 
           forward=True, 
           floating=False, 
           scoring='roc_auc',
           cv=5)
sfs1 = sfs1.fit(data.values, y.values)

clf1_pipe = Pipeline([('sfs1', sfs1),
                      ('Logistic Newton', clf1)])

print clf1_pipe.named_steps['sfs1'].k_feature_idx_
# (0, 1, 3, 4, 5, 9, 10, 11, 12, 13, 14, 15, 16, 17, 20, 21, 22, 23, 25, 27, 29, 30, 31, 34, 35)

The score clf1_pipe.named_steps['sfs1'].k_score_is 0.6956081 but it is not the best score (performance) we got. In fact for we have a better score with 10 features :

result_clf1_pipe = pd.DataFrame.from_dict(clf1_pipe.named_steps['sfs1'].get_metric_dict(confidence_interval=0.90)).T
result_clf1_pipe.sort_values('avg_score', ascending=0, inplace=True)
result_clf1_pipe.head()

Can we during the Pipeline process get automatically the feature selection corresponding to the best performance ?

You can find the nootbook with the pipeline process SFS ("Using Pipieline to do it").

Edit :
I manually research the best number of k_features for SFS for all my Estimators. Then I plug them in a EnsembleVoteClassifier. The result is not what I expected (see : "Find Manually the best k_features for SFS and fit our ensemble ")

Use .csv and csv.gz files instead of .py files for fetching the dataset

It would be a good idea to move the datasets out of the .py files and create a universal fetch function that can be used within the original loading functions. All the CSV files are quite small except for the MNIST subset, which I would compress into a .csv.gz.

Make mlxtend available through conda-forge

conda install -c rasbt mlxtend
would install the old version of mlxtend. It would be nice to have the new version available through conda.

plot_decision_regions: bug with the expected shape of X_highlight

If you pass an array of points in shape [n_samples, n_features], as stated in the docstring, a ValueError('X_highlight must be a 2D array') it is raised.

StackingCVClassifier

Hello.
Don't you think, that your realization of stacking, where you are fitting 1-level clfs on whole X_train and after that just predict labels(or probs) by them on the same X_train is not really good? It may lead to big overfitting of this predicted labels.
I think, that better approach is to make folds, and train clfs smth like this ( you train on (1 - 1/n_folds) part of train data set, then predict labels for the rest of X_train and do it for all folds, when you finish it, you have labels for the whole X_train, but without overfitting, and now you may fit one more time every clf to predict X_test labels (here you can use the whole X_train for fitting them)):

def fit(self, X_train, y_train):
        self.n_examples = X_train.shape[0]
        self.n_classes = len(set(y_train))
        self.folds = KFold(n=self.n_examples, n_folds=self.n_folds)

        clfs_preds = np.array([]).reshape(self.n_examples, 0)

        for clf in self.clfs:
            clf_ = clone(clf)
            clf_preds = np.array([]).reshape(0, self.n_classes*self.probas + 1 * (not self.probas))
            for train, pred in self.folds:
                clf_.fit(X_train[train], y_train[train])
                if not self.probas:
                    clf_pred = clf_.predict(X_train[pred]).reshape(X_train[pred].shape[0], 1)
                else:
                    clf_pred = clf_.predict_proba(X_train[pred]).reshape(X_train[pred].shape[0], self.n_classes)
                clf_preds = np.concatenate([clf_preds, clf_pred], axis=0)

            clfs_preds = np.concatenate([clfs_preds, clf_preds], axis=1)
        # fitting the clfs to predict X_test labels or probabilities when fitting the meta-classifier
        for clf in self.clfs:
            clf.fit(X_train, y_train)

        # add clfs predictions to X_train table or not
        if self.append_preds:
            X_train_ext = np.concatenate([X_train, clfs_preds], axis=1)
        else:
            X_train_ext = clfs_preds
        self.meta_clf.fit(X_train_ext, y_train)
        return self

    def predict(self, X_test):

        clfs_preds = np.array([]).reshape(X_test.shape[0], 0)
        for clf in self.clfs:
            if not self.probas:
                clf_pred = clf.predict(X_test).reshape(X_test.shape[0], 1)
            else:
                clf_pred = clf.predict_proba(X_test).reshape(X_test.shape[0], self.n_classes)
            clfs_preds = np.concatenate([clfs_preds, clf_pred], axis=1)

        # add clfs predictions to X_test table or not
        if self.append_preds:
            X_test_ext = np.concatenate([X_test, clfs_preds], axis=1)
        else:
            X_test_ext = clfs_preds

        return self.meta_clf.predict(X_test_ext)

Add separate development documentation

Add a separate documentation for development version as http://rasbt.github.io/mlxtend/dev

Inconsistent Travis issues with certain unit tests

Once in a while, e.g., every 10th time, Travis seems to complain about a certain unit test of the multi-layer perceptron. I couldn't find out what causes the problem but can't reproduce the issue on my local machine. Also, it works just fine via Travis ~90% of the time. I am wondering if it is related to the random state being set incorrectly, but since the random state is set equally for all classifiers inside the fit method that is imported from _BaseSupervisedEstimator, I don't think this is the issue here. I think it is something hardware specific maybe.

I was wondering if someone has an idea of what could be going on here? Would really appreciate any ideas and feedback!

As a current work around this problem, I changed the unit test to the following:

def test_multiclass_gd_acc():
    mlp = MLP(epochs=20,
              eta=0.05,
              hidden_layers=[10],
              minibatches=1,
              random_seed=1)
    mlp.fit(X, y)
    assert round(mlp.cost_[0], 2) == 0.55, mlp.cost_[0]

    if round(mlp.cost_[-1], 2) == 0.25:
        warnings.warn('About 10% of the time, mlp.cost_[-1] is'
                      ' 0.247213137424 when tested via Travis CI.'
                      ' Likely, it is an architecture-related problem but'
                      ' should be looked into in future.')
    else:
        assert round(mlp.cost_[-1], 2) == 0.01, mlp.cost_[-1]
        assert (y == mlp.predict(X)).all()

Ability to install mlxtend with conda

Hi,

I really love this package and the documentation, it's been very helpful for a machine learning beginner like myself. I am wondering if you would consider putting this package onto the Anaconda Cloud so that it can be downloaded with conda in addition to pip, like so:

conda install mlxtend

The benefit of doing this is it would make things easier for Windows users who install MiniConda (vs. Anaconda, which comes with a bunch of scientific computing packages by default) because some of this pacakge's dependencies (scipy, numpy, matplotlib) can be difficult to install on Windows via pip.

Personally I am fine using pip, but I often share the code I write on Ubuntu with co-workers who use Windows, and for them using conda makes their lives easier.

This is a very minor issue, but I hope it will make access to this package easier.

Thanks!

Use futurepast to deprecate plotting functions from (0.4.2 -> 0.4.3)

There seems to be a handy package that makes deprecation warning more convenient. Let's use it from now on

https://github.com/amueller/futurepast

moving stuff upstream

hey. Do you want to move some of this stuff upstream?
I wanted to have a general decision region / decision boundary function because it is reimplemented in soo many examples.

Add Dropout to TfMultiLayerPerceptron

Add a custom scorer for the LIFT metric

build a scikit-learn compatible custom scorer for computing the LIFT metric

Use all probabilities in StackingClassifier instead of averaging

Use all level-1 class probabilities to train the second level classifier instead of averaging probabilities for each classifier, which should provide more information to the second-level classifier (as suggested in #29)

Paper Reference: Ting, Kai Ming, and Ian H. Witten. "Issues in stacked generalization." J. Artif. Intell. Res.(JAIR) 10 (1999): 271-289.

Decision region plot for more than 4 classes

I am not 100% sure if this is a bug in matplotlib, but I would appreciate any helps or insights. There may be a work-around for solving the problem below, but I haven't found it, yet. (Also see this matplotlib discussion as reference.

The following code -- plotting 4 decision regions -- works just fine (1, 2, and 3 classes work fine as well):

(Note that the code below is a simplified version of the plot_decision_regions plot in mlxtend. If we can fix this simple issue below -- I think it is related to the ListedColormap -- it can be directly transferred to the plot_decsion_regions function)

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

def plot_decision_regions(X, y, classifier, resolution=0.1):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))+1])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=colors[idx],
                    marker=markers[idx], label=cl)


# Loading some example data
iris = datasets.load_iris()
X = iris.data[:, [0,2]]
y = iris.target
y = np.concatenate((y, np.ones(50)+2))
y = y.astype(int)
X = np.concatenate((X, X[:50]*2))

lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr.fit(X, y)
plot_decision_regions(X, y, classifier=lr)

However, if I add a third class, I get the following problem:

iris = datasets.load_iris()
X = iris.data[:, [0,2]]
y = iris.target
y = np.concatenate((y, np.ones(50)+2, np.ones(50)+3))
y = y.astype(int)
X = np.concatenate((X, X[:50]*2, X[:50]*3))

lr = LogisticRegression(solver='newton-cg', multi_class='multinomial')
lr.fit(X, y)
plot_decision_regions(X, y, classifier=lr)

add L2 regularization to TfMultiLayerPerceptron

Docs: incorrect image in confusion matrix example

Problem: the image for Example 1 is incorrect. The given array is:

array([[3, 1],
       [2, 2]])

Instead of showing the correct confusion matrix for the given array, the image displayed is a duplicate of the Example 2 image:

Source: appears to be cells run out of order in confusion_matrix.ipynb

permanently stuck due to different permutations of k_idx lists in sdq

Program will become permanently stuck if sdq contains k_idx lists that are different permutations of one another. This is easily fixed by appending k_idx as sets instead of lists.

Unfortunately I do not know of a dataset that I can use to show this example. Without a usable dataset to exploit this problem it is difficult to write a unittest for this.

Segmentation fault on plotting confusion matrix!

I am trying to plot confusion matrix of 1000 labels and save it to a picture file. After using 15GB of memory, code is terminating with segmentation fault! Any reason?

Backport `use_features_in_secondary` to `StackingClassifier`

The StackingCVClassifier has an optional use_features_in_secondary parameter, which, if True, will feed the original features (in addition to the meta-features) to the level-2 classifier. Would be nice to add this to the StackingClassifier as well!

Travis-CI fails with "scikit-learn" missing excpetion

https://travis-ci.org/rasbt/mlxtend/jobs/163844960
reports
ERROR: Failure: ImportError (No module named 'sklearn.exceptions')

while "sklearn.exceptions" is present in the scikit-learn>=0.17 , which is specified in requirements.txt
please explain me if I'm doing anything wrong here!

shuffle_arrays_unison: inconsistency in the random state parameter

The random state parameter in shuffle_arrays_unison function is random_seed but in the docstring it is named random_state. The same exists in the docs http://rasbt.github.io/mlxtend/user_guide/preprocessing/shuffle_arrays_unison/. I would make a PR but I don't know what we want to keep.

Numerically stable OLS

You might consider switching to using SVD (pseudoinverse) or QR decomposition for OLS. For SVD you could update after seeing new data. R's lm uses QR by default for some nice features you get for ANOVA after.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L185

Translating from weird statsmodels notation

For SVD

beta = np.linalg.pinv(X).dot(y)

For QR

Q, R = np.linalg.qr(X)
# for ANOVA
effects = np.dot(Q.T, y)
beta = np.linalg.solve(R, effects)

EnsembleRegressor

I opened this issue as a follow up on a discussion with

@nikhilRP:

Can we have some thing like "VotingClassifier" for regression as well.

I believe the same concept would also work for regression from a technical perspective. However, I am not sure how "good" or "practical" this would be. I think one would have to be especially careful with the averaging if there is a high gap between the predicted targets, e.g., if we have 3 regressors with predicted targets of e.g., 1.1, 1.3, 1031.4, averaging could yield some strange behavior. Maybe it would be worthwhile to include some optional outlier detection. Here are some examples

for < 10 regressor, use Dixon's Q test to remove lower-end outliers
for >= 10 regressor, maybe exclude everything what's outside the IQR (if the predicted scores are represented as a "box plot")
as an alternative to 1. and 2., use sequential feature selection to select a k-sized subset of regressors so that the performance of the average fit is "optimal."

I believe that an implementation would be pretty straight-forward from a technical perspective using the Ensemble Classifier as a template.

In addition, an interesting idea by @nikhilRP was

[to] use r-squared scores (or any other measure) to determine weights of each predicted targets.

Honestly, I haven't really thought about it, yet, and I suggest that we brainstorm a bit more. However, in order to get started we could already prepare a scaffold EnsembleRegressor that takes scikit-learns regressors as input estimators and outputs the average fits (similar to the EnsembleClassifier's predict_proba). :)

Note

The mlxtend is currently being slightly overhauled (see branch https://github.com/rasbt/mlxtend/tree/03).

The mlxtend.classifier.EnsembleClassifier from the mlxtend.classifier.ensemble.py module has been moved to the mlxtend.classifier.ensemble_vote.py module where it is now mlxtend.classifier.EnsembleVoteClassifier. The reason is that the new "description" is more specific and would help distinguishing it from a futuremlxtend.classifier.EnsembleStackingClassifier etc.
Furthermore, the regression subpackage has been split into 2 new subpackages regressor and regression_utils -- The former contains classes (estimators) for fitting regression models, and the latter contains auxiliary functions such as visualization tools etc.

I am preparing the new documentation and website right now, and the new release (v. 0.3) will probably ready next week and ready to be merged into the current master branch.

Thus, an EnsembleRegressor would best fit into the mlxtend.regressor subpackage.

Set up AppVeyor

does not seem to install on Ubuntu 16.04 x64

Hi.

Below commands that I run on a clean install of Ubuntu 16.04 / x64 (on DigitialOcean)

Preparation:

root@delme:~# python3 --version
Python 3.5.1+

apt -y install python3-pip

root@delme:~# pip3 --version
pip 8.1.1 from /usr/lib/python3/dist-packages (python 3.5)

Pip3 install (note 'Killed')

root@delme:~# pip3 install mlxtend
Collecting mlxtend
  Downloading mlxtend-0.4.1.tar.gz (1.1MB)
    100% |████████████████████████████████| 1.1MB 451kB/s 
Building wheels for collected packages: mlxtend
  Running setup.py bdist_wheel for mlxtend ... done
  Stored in directory: /root/.cache/pip/wheels/fb/59/f3/284c574e254e2e619a93e76ec9644def550940c13ac9fe8576
Successfully built mlxtend
Installing collected packages: mlxtend
Killed
root@delme:~#

Install from github

pip3 install git+git://github.com/rasbt/mlxtend.git#egg=mlxtend

root@delme:~# pip3 install git+git://github.com/rasbt/mlxtend.git#egg=mlxtend
Collecting mlxtend from git+git://github.com/rasbt/mlxtend.git#egg=mlxtend
  Cloning git://github.com/rasbt/mlxtend.git to /tmp/pip-build-f0aeav6_/mlxtend
Installing collected packages: mlxtend
  Running setup.py install for mlxtend ... error

Command "/usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-f0aeav6_/mlxtend/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-22e_cbrj-record/install-record.txt --single-version-externally-managed --compile" failed with error code -9 in /tmp/pip-build-f0aeav6_/mlxtend/

Typo error of label in example image

Hi Sebastian,

I spotted a typo error in the example at readme. Please replace Naive Bayes with SVM label instead.

Thank you for your wonderful package.
Laam

Move all plots to mlxtend.plotting and add deprecation warnings

To avoid requiring additional dependencies that are necessary for the core functionality of mlxtend, I think that it would be a good idea to move and unify all plotting functions and modules to a separate subpackage, e.g., mlxtend.plotting

EnsembleClassifier with scaled and unscaled data

Hello,
I am training and fitting data using different classifieres. For some of them (eg. Multi-layer perceptron) I scale the training and testing matrices using scikits StandardScaler(). For others (eg. Random Forest) I do not scale the data.

I would like to use the EnsembleClassifier and was wondering how to use it with classifiers that were trained with different matrices (scaled vs unscaled).

Thanks!

Add utilities to implement deprecation warnings

'StackingClassifier' object has no attribute 'clf_'

sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], meta_classifier=lr, use_probas=True, verbose=1)

when I set use_probas=True, I got these.

AttributeError:'StackingClassifier' object has no attribute 'clf_'

what should I do?

Add class weights to the EnsembleVotingClassifier

I used gradient boosting classifier to build a classification model. I am trying to improve the model by using a stack up model. I want to ensemble 3 different models, let's say, gbm, randomforests, logistic regression (except for gbm, other models subject to change). In my GBM model, I used weights in the fit function by giving higher weights to positive target variable. I want too try the same thing in ensemble, but I am unable to figure out how to implement weights in the source code of the ensemblevotingclassifier. I am new to this, so would like to receive suggestions regarding implementation of weights

Thanks

k_features="best" not documented

In the pipy version the SequentialFeatureSelector has by default k_features="best" but this option is not explained in the docstring.

Linear Regression Plotting not working

Hi, First of all ... awesome work man !!
I am new to python and after moving from R on kaggle found link to 'mlxtend'.

Now my issue is following code not plotting anything.

from mlxtend.regression import lin_regplot
import numpy as np

X = np.array([4, 8, 13, 26, 31, 10, 8, 30, 18, 12, 20, 5, 28, 18, 6, 31, 12, 12, 27, 11, 6, 14, 25, 7, 13,4, 15, 21, 15])
y = np.array([14, 24, 22, 59, 66, 25, 18, 60, 39, 32, 53, 18, 55, 41, 28, 61, 35, 36, 52, 23, 19, 25, 73, 16, 32, 14, 31, 43, 34])
intercept, slope, corr_coeff = lin_regplot(X[:,np.newaxis], y,)

Other codes like classifiers and 1d,2d, learning curve are working as mentioned on main page.
Can you help in this regard ? Did I miss something .. all codes given by you seems to be straight forward !!!

Pipe using `SequentialFeatureSelector`: problems passing metric parameters

Hello,

I posted this question in the Google groups but it does not seem to attract any attention. So I am posting this here. If this is not correct, please tell me.

I have taken some Scikit source code that used the standard grid search and adapted it to using a
pipe with the use of the SFS. I use the the "seuclidean" metric with the ball-tree algorithm that requires a metric parameter - a variance vector. When I execute the Scikit standard code I have no problem. However with the SFS in a Pipeline I have two errors:

If I do not provide the metric's parameters I get the (see stack trace 1):
TypeError: __init__() takes exactly 1 positional argument (0 given)
If I provide the parameter I get (see stack trace 2):
ValueError: SEuclidean dist: size of V does not match

Error 2 is understandable - because SFS does feature selection, I cannot pre-calculate this value. It depends on the features used. I was expecting the metric parameters to be automatically calculate and therefore not require this input. I also tried to pass None as the parameter, but with no success.

Can anyone shed light on how I should proceed? I have added my code below in case this helps
(data sets managed with Pandas).

TIA,
Hugo

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing

# get the unormalized data
X = dy[ dy.columns.difference(['label']).values ]
y = dy['label'].values                           

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

V = X_train.var().values
C = X_train.cov().values
CPI = np.linalg.pinv(C)
CI = np.linalg.inv(C)

# k_range : must be less than the training size. What happens if number of features > sample size
k_range    = range(1, len(X.columns))
weights    = ['uniform' , 'distance']
#algos_all  = ['auto', 'ball_tree', 'kd_tree', 'brute']
algos_all  = ['ball_tree', 'kd_tree', 'brute']
algos      = ['brute', 'kd_tree']
leaf_sizes = range(5, 60, 10)   
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski"]


# Metric can only be used with certain algorithms
# Metrics intended for real-valued vector spaces:

seuclidean = {
    'sfs__k_features'              : list(range(1,len(X.columns))),
    'sfs__estimator__metric'       : ['seuclidean'],
    'sfs__estimator__metric_params': [ {'V':V} ],  # will be automatically calculated
    'sfs__estimator__algorithm'    : ['ball_tree'],  # TODO , ['brute', 'ball_tree'],
    'sfs__estimator__n_neighbors'  : list(k_range),
    'sfs__estimator__weights'      : weights,
    'sfs__estimator__leaf_size'    : list(leaf_sizes) }

from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import mlxtend

# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)
#print(knn.get_params().keys())

sfs1 = SFS(estimator=knn,
           k_features=3,
           forward=True,
           floating=False,
           scoring='accuracy',
           print_progress=False,
           cv=5)
           # !?!? n_jobs=-1)

pipe = Pipeline([
                 ('standardize', preprocessing.MinMaxScaler()),
                 ('sfs', sfs1),
                 ('knn', knn)])

# See KNeighborsClassifier equivalent param_grid
param_grid = [
    seuclidean
  ]

# Instantiate the grid search
gs = GridSearchCV(estimator=pipe,
                  param_grid=param_grid,
                  scoring='accuracy',
                  #n_jobs=-1,  for better stack tracing
                  cv=5,
                  verbose=1,
                  refit=True)

# Run the grid search
gs = gs.fit(X_train.values, y_train)

Stack Trace 1

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits

TypeError                                 Traceback (most recent call last)
<ipython-input-68-4ef553dad211> in <module>()
    167
    168 # Run the grid search
--> 169 gs = gs.fit(X_train.values, y_train)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
    802
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805
    806

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
    554                 for train, test in cv)
    555

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    162             the pipeline.
    163         """
--> 164         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    165         self.steps[-1][-1].fit(Xt, y, **fit_params)
    166         return self

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in _pre_transform(self, X, y, **fit_params)
    143         for name, transform in self.steps[:-1]:
    144             if hasattr(transform, "fit_transform"):
--> 145                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    146             else:
    147                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit_transform(self, X, y)
    239
    240     def fit_transform(self, X, y):
--> 241         self.fit(X, y)
    242         return self.transform(X)
    243

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y)
    136                     self._inclusion(orig_set=orig_set,
    137                                     subset=prev_subset,
--> 138                                     X=X, y=y)
    139             else:
    140                 k_idx, k_score, cv_scores = \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y)
    205             for feature in remaining:
    206                 new_subset = tuple(subset | {feature})
--> 207                 cv_scores = self._calc_score(X, y, new_subset)
    208                 all_avg_scores.append(cv_scores.mean())
    209                 all_cv_scores.append(cv_scores)

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _calc_score(self, X, y, indices)
    190                                      scoring=self.scorer,
    191                                      n_jobs=self.n_jobs,
--> 192                                      pre_dispatch=self.pre_dispatch)
    193         else:
    194             self.est_.fit(X[:, indices], y)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1431                                               train, test, verbose, None,
   1432                                               fit_params)
-> 1433                       for train, test in cv)
   1434     return np.array(scores)[:, 0]
   1435

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in fit(self, X, y)
    801             self._y = self._y.ravel()
    802
--> 803         return self._fit(X)
    804
    805

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in _fit(self, X)
    256             self._tree = BallTree(X, self.leaf_size,
    257                                   metric=self.effective_metric_,
--> 258                                   **self.effective_metric_params_)
    259         elif self._fit_method == 'kd_tree':
    260             self._tree = KDTree(X, self.leaf_size,

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:8381)()

sklearn/neighbors/dist_metrics.pyx in sklearn.neighbors.dist_metrics.DistanceMetric.get_metric (sklearn/neighbors/dist_metrics.c:4330)()

sklearn/neighbors/dist_metrics.pyx in sklearn.neighbors.dist_metrics.SEuclideanDistance.__init__ (sklearn/neighbors/dist_metrics.c:5888)()

TypeError: __init__() takes exactly 1 positional argument (0 given)

Stack Trace 2

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits

ValueError                                Traceback (most recent call last)
<ipython-input-69-558dd50887b6> in <module>()
    167
    168 # Run the grid search
--> 169 gs = gs.fit(X_train.values, y_train)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in fit(self, X, y)
    802
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
    805
    806

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
    554                 for train, test in cv)
    555

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    162             the pipeline.
    163         """
--> 164         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    165         self.steps[-1][-1].fit(Xt, y, **fit_params)
    166         return self

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/pipeline.py in _pre_transform(self, X, y, **fit_params)
    143         for name, transform in self.steps[:-1]:
    144             if hasattr(transform, "fit_transform"):
--> 145                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    146             else:
    147                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit_transform(self, X, y)
    239
    240     def fit_transform(self, X, y):
--> 241         self.fit(X, y)
    242         return self.transform(X)
    243

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in fit(self, X, y)
    136                     self._inclusion(orig_set=orig_set,
    137                                     subset=prev_subset,
--> 138                                     X=X, y=y)
    139             else:
    140                 k_idx, k_score, cv_scores = \

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _inclusion(self, orig_set, subset, X, y)
    205             for feature in remaining:
    206                 new_subset = tuple(subset | {feature})
--> 207                 cv_scores = self._calc_score(X, y, new_subset)
    208                 all_avg_scores.append(cv_scores.mean())
    209                 all_cv_scores.append(cv_scores)

/home/hmf/my_py3/lib/python3.4/site-packages/mlxtend/feature_selection/sequential_feature_selector.py in _calc_score(self, X, y, indices)
    190                                      scoring=self.scorer,
    191                                      n_jobs=self.n_jobs,
--> 192                                      pre_dispatch=self.pre_dispatch)
    193         else:
    194             self.est_.fit(X[:, indices], y)

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
   1431                                               train, test, verbose, None,
   1432                                               fit_params)
-> 1433                       for train, test in cv)
   1434     return np.array(scores)[:, 0]
   1435

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    798             # was dispatched. In particular this covers the edge
    799             # case of Parallel used with an exhausted iterator.
--> 800             while self.dispatch_one_batch(iterator):
    801                 self._iterating = True
    802             else:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    656                 return False
    657             else:
--> 658                 self._dispatch(tasks)
    659                 return True
    660

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    564
    565         if self._pool is None:
--> 566             job = ImmediateComputeBatch(batch)
    567             self._jobs.append(job)
    568             self.n_dispatched_batches += 1

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, batch)
    178         # Don't delay the application, to avoid keeping the input
    179         # arguments in memory
--> 180         self.results = batch()
    181
    182     def get(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
     70
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73
     74     def __len__(self):

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1529             estimator.fit(X_train, **fit_params)
   1530         else:
-> 1531             estimator.fit(X_train, y_train, **fit_params)
   1532
   1533     except Exception as e:

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in fit(self, X, y)
    801             self._y = self._y.ravel()
    802
--> 803         return self._fit(X)
    804
    805

/home/hmf/my_py3/lib/python3.4/site-packages/sklearn/neighbors/base.py in _fit(self, X)
    256             self._tree = BallTree(X, self.leaf_size,
    257                                   metric=self.effective_metric_,
--> 258                                   **self.effective_metric_params_)
    259         elif self._fit_method == 'kd_tree':
    260             self._tree = KDTree(X, self.leaf_size,

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:8793)()

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree._recursive_build (sklearn/neighbors/ball_tree.c:10053)()

sklearn/neighbors/ball_tree.pyx in sklearn.neighbors.ball_tree.init_node (sklearn/neighbors/ball_tree.c:20030)()

sklearn/neighbors/binary_tree.pxi in sklearn.neighbors.ball_tree.BinaryTree.rdist (sklearn/neighbors/ball_tree.c:9932)()

sklearn/neighbors/dist_metrics.pyx in sklearn.neighbors.dist_metrics.SEuclideanDistance.rdist (sklearn/neighbors/dist_metrics.c:6065)()

ValueError: SEuclidean dist: size of V does not match

Installation failed on ubuntu 12.04

I tried to install it via:

sudo pip install

but the following error was raised, which looks like one of the files doesn't exist, which seems that CHANGELOG is moved to the parant directory!

running install_data
copying LICENSE -> /usr/local/
copying docs/README.html -> /usr/local/
error: can't copy 'docs/CHANGELOG.txt': doesn't exist or not a regular file
----------------------------------------
Cleaning up...

`load` and `save` methods for estimators

Add load and save methods to classifiers and regressors using JSON to avoid common pickle issues / platform incompatibilities -- based on https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/scikit-model-to-json.ipynb

Add Documentation for StackingCVClassifier

Adding the documentation for PR #102

adding static type checking via mypy

Add type hints (see PEP484) and static type checking to Travis CI using mypy.

however, the question is whether we want to use the full type-hinting syntax (and maybe lose compatibility with older python versions) or just the type-hinting via comments. e.g.,

def hello(r: int, c=5) -> str:
   s = 'hello'  # type: str
   return '(%d + %d) times %s' % (r, c, s)

vs.

def hello(r, c=5):
   s = 'hello'  # type: str
   return '(%d + %d) times %s' % (r, c, s)

Any thoughts?

UPDATE:

As Daniel Moisset from machinalis pointed out, a Python 2.x compatible alternative to the first scenario above would be:

def hello(r, c=5):
    # type: (int, int) -> str
    """Some docstring"""
    s = 'hello'
    return '(%d + %d) times %s' % (r, c, s)

or for longer parameter lists:

def hello(r,  # type: int
          c=5):
    # type: (...) -> str
    s = 'hello'
    return '(%d + %d) times %s' % (r, c, s)

which would be the best option for now imho.

Also see his awesome blog post for more info on using mypy: http://www.machinalis.com/blog/a-day-with-mypy-part-1/

Add option to drop NaNs in SequentialFeatureSelector

I am trying to use SequentialFeatureSelector on a dataset where the number of available features is on the same order of magnitude as the number of samples. The dataset has lots of missing values (NaNs) that cannot be imputed from other samples: they simply don't make sense in some cases.

I can obviously drop NaNs before feeding anything to the feature selector, but this needlessly reduces the number of available data points, as the missing values don't happen always on the same rows, and I'd never want to fit my model using all columns at the same time.

A way around this problem would be to allow the NA dropping to (optionally) happen within the ColumnSelector.transform. In this way, I'd only be dropping the rows for which there are NAs in the specific columns that are needed for a test. This, however, breaks the sklearn API, as it would require the transform method to modify also the target vector y, so it seems I cannot just add a custom transformer to drop NA's to the base estimator.

An alternative solution could be to hard-code the dropna within the SequentialFeatureSelector._calc_score method, calling it before using cross-validation (find the row indices that contain NAs for the selected columns, then slice X and y by those rows before calling the scoring function). Would this be an acceptable/desirable change? I can put together a quick implementation if you think it is worth it.

Correcting documentation: online annotations via hypothesis

Hello,

While looking at the feature selection documentation at:

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

I have come across several typos. In order to facilitate quick annotations of the text maybe the following tool is useful:

https://hypothes.is/

Anyone can comment, ask questions, suggest alterations and point out typos quickly. In this way the maintainer has access to the location and comment in a single web-page.

HTHs

Formatting on file_io.find_files parameters is off on the documentation website

Documentation: http://rasbt.github.io/mlxtend/user_guide/file_io/find_files/#api

The parameters recursive and check_ext are not formatted as individual parameters but are instead in the same bullet point as path.

Score Appending Issue ?!

Hi,

i just started working with the package and it seems great, however if i want to use Sequential Forward Selection i constantly get:

AttributeError: 'NoneType' object has no attribute 'write'

Using tried different estimators from sklearn kit and all the sklearn build in feature selections tools seems to get along fine with the data.
Really dont know whats the problem

The error in detail is:

feature_selection\sequential_feature_selector.py in fit(self=SequentialFeatureSelector(clone_estimator=True, ...e, scoring='r2',
skip_if_stuck=True), X=array([[-0.09449112, 0.09449112, -0.95103007, .... 0.76772251,
0.76772251, -0.20334318]]), y=array([ 0.75431081, 0.58790606, 0.66942214, 0...3, 0.82269534, 0.79686278,
0.48766136]))
233 'cv_scores': cv_scores,
234 'avg_score': k_score}
235 sdq.append(k_idx)
236
237 if self.print_progress:
--> 238 sys.stderr.write('\rFeatures: %d/%d' % (
k_idx = (538,)
k_to_select = 20
239 len(k_idx), k_to_select))
240 sys.stderr.flush()
241
242 if select_in_range:

AttributeError: 'NoneType' object has no attribute 'write'

rasbt / mlxtend Goto Github PK

mlxtend's Issues

Stack Trace 1

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits

Stack Trace 2

Fitting 5 folds for each of 1200 candidates, totalling 6000 fits

UPDATE:

Recommend Projects

Recommend Topics

Recommend Org