Giter Club home page Giter Club logo

scikit-learn's Introduction

_ CirrusCI_ Codecov_ CircleCI_ Nightly wheels_ Black_ PythonVersion_ PyPi_ DOI_ Benchmark_

image

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

  • Python (>= 3.9)
  • NumPy (>= 1.19.5)
  • SciPy (>= 1.6.0)
  • joblib (>= 1.2.0)
  • threadpoolctl (>= 2.0.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 1.0 and later require Python 3.7 or newer. scikit-learn 1.1 and later require Python 3.8 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with Display) require Matplotlib (>= 3.3.4). For running the examples Matplotlib >= 3.3.4 is required. A few examples require scikit-image >= 0.17.2, a few examples require pandas >= 1.1.5, some examples require seaborn >= 0.9.0 and plotly >= 5.14.0.

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 7.1.2 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

Communication

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

scikit-learn's People

Contributors

adrinjalali avatar agramfort avatar ahojnnes avatar amueller avatar arjoly avatar cmarmo avatar glemaitre avatar glouppe avatar jakevdp avatar jeremiedbb avatar jjerphan avatar jnothman avatar larsmans avatar lesteve avatar lorentzenchr avatar lucyleeow avatar mblondel avatar mechcoder avatar nellev avatar nicolashug avatar ogrisel avatar pprett avatar qinhanmin2014 avatar raghavrv avatar robertlayton avatar rth avatar thomasjpfan avatar tomdlt avatar vene avatar weilinear avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-learn's Issues

The logistic regression has no narritive documentation

The logistic regression should have an entry under 'Generalized Linear Models' and the intro of the GLM section should point to it to do regression.

Right now it is not clear from the documentation that the scikit even does logistic regression. It should appear in the table of content.

Bayesian regression

Add priors on the mean instead of assuming that the prior means are zero.
Add more reference in the doc.
Modify ARD in order to use a vector of hyperparameters for the precision instead of a single value.
Spelling : defaut -> default unless you insist on using french, spelling out ARD (Automatic Relevance Determination) regression in the docstring as in the comment line would be useful.

Thanks to Josef for the remarks.

Macro/micro average precision/recall/f1-score

When n_classes > 2, the precision / recall / f1-score need to be averaged in some way.

Currently the code in precision_recall_fscore_support does:

precision = true_pos / (true_pos + false_pos)
recall = true_pos / (true_pos + false_neg)

Since true_pos, false_pos and false_neg are arrays of size n_classes, precision and recall are also arrays of the same size. Then to obtain a single average, the weighted sum is taken.

In the literature, the macro-average and micro-average are usually used but as far as I understand the current code does neither one. The macro is the unweighted average of the precision/recall taken separately for each class. Therefore it is an average over classes. The micro average on the contrary is an average over instances: therefore classes which have many instances are given more importance. However, AFAIK it's not the same as taking the weighted average as currently done in the code.

I think the code should be:

micro_avg_precision = true_pos.sum() / (true_pos.sum() + false_pos.sum())
micro_avg_recall = true_pos.sum() / (true_pos.sum() + false_neg.sum())
macro_avg_precision = np.mean(true_pos / (true_pos + false_pos))
macro_avg_recall = np.mean(true_pos / (true_pos + false_neg))

It's easy to fix (add a micro=True|False option) but the tests may be a pain to update :-/

Stability of LARS

>>> clf = LassoLARS()
>>> clf.fit([[0, 0], [1, 1]], [0, 1], alpha=0.0).coef_
array([ NaN,  NaN])

kernel object interface

For precomputed kernels, a square matrix is not an efficient way to store the kernel matrix (since the kernel matrix is symmetric).

We should create a kernel object interface instead. Advantages:

  • The object can store the LRU cache
  • This gives a way for the user to handle kernel re-computations
  • Internally the gram matrix can be stored in packed format
  • This will be useful when we create our own kernel-based estimators (some like Ridge already support kernels, they just lack the interface)
  • This handle nicely the kernel computations between test instances and training instances

The object could be numpy-compatible:

kernel = GaussianKernel(X_train, sigma=0.5)
print kernel[i, j] # recompute only if not cached
print kernel.compute(X_test)

Question: shall we create our own LRU object or shall we just bind libsvm's?

Since there's plan to bind libsvm's cross-validation code, this also means that the cache will be used more efficiently for cross-validation even when kernel="precomputed".

(Sorry for opening many tickets lately: I acually intend to help close them when I get more time ;-)

Sort decision function in svm module

I think decision function should be sorted just as we do for predict proba, where we sort by class label as in probas[:,np.argsort(self.label_)]

k-means++ initialization wrong

The k-means center initialization in https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cluster/k_means_.py, function k_init(), is not k-means++. The paper authors' implementation (http://www.stanford.edu/~darthur/kMeansppTest.zip, Utils.cpp, chooseSmartCenters()) does the following:

  • One center is chosen randomly.
  • Now repeat numCenters-1 times:
    • Repeat numLocalTries times:
      • Add a point x with probability proportional to the distance squared from x to the closest existing center
    • Add the point chosen above that results in the smallest potential.

scikit-learn's implementation (taken from pybrain, which in turn took it from Yong Sun's blog at http://blogs.sun.com/yongsun/entry/k_means_and_k_means) does this instead:

  • One center is chosen randomly.
  • Now repeat numCenters-1 times:
    • Repeat numSamples times:
      • Add a point x that has not been tried yet
    • Add the point chosen above that results in the smallest potential.

The authors' implementation samples numLocalTries points with D^2 weighting and chooses the best among those (repeated for each of the k-1 centers to find). For all the results in their paper, the authors used numLocalTries==1. (Only in their "Conclusion and future work" section they state that "experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased \phi as much as possible", and in their code you can see they tried numLocalTries==2+log(k).)

scikit-learn completely omits the sampling step (authors' Utils.cpp, lines 299-305) and instead greedily chooses the center that minimizes the potential. While this is not necessarily bad, it is not k-means++.

The easy way to fix this would be changing the documentation to not refer to k-means++ any longer (and find out how this greedy scheme is called in the literature; I assume somebody described it already), the better way would be fixing the implementation. I will do the latter (unless I decide I don't need it) and post back here; until then just take this as a warning.

improvements in naive_bayes

  • across the whole module estimated parameters do not end with underscore
    • non-standard names are used for some variables, like unique_y (called classes in SGD and labels in SVM), we should think about unifying these.

affinity propagation failed for identity matrix

Affinity propagation do not handle identity matrix correctly:

In [81]: s = np.array([[1, 0], [0, 1]])

In [82]: affinity_propagation(s, verbose=True)
Did not converged
Out[82]:
(None, array([[ nan],
[ nan]]))

When type of s is float, ap converged. Thus the reason might be the numerical computation consistency.

BTW, it seems there are two ways to report bugs, sf and github. Do you have any preference?

copy parameter in all transform methods

For consistency and to be able to write an efficient pipeline, all transform methods should honor a copy=True|False parameter.

I'm not on my working environment so I cannot check right now but we should make a list of the methods that need to be fixed in this ticket.

Bug in k-means centering

Here is a report I received by anonymous private mail:

https://github.com/scikit-learn/scikit-learn/blob/master/scikits/learn/cluster/k_means_.py

Line 176:

175        elif hasattr(init, '__array__'):
176            centers = np.asanyarray(init).copy()
177        elif callable(init):

You take predefined centers as an optional initialization method. You
copy them directly into the kmeans, but you don't account for the fact
that the X data has already been centered on line 167:

167    X -= Xmean

Also, when you return the centers, you make sure to add xmean back:

208    return best_centers + Xmean, best_labels, best_inertia

This seems like a bug, but I could be wrong in some very subtle way.

The obvious fix for this would be to replace line 176 with this:

centers = np.asanyarray(init).copy() - Xmean

improvements for the doc

Rename Gallery -> Example Gallery
User guide should be more nested
h2, h3 should be padded to one side

common file format support

It would be nice to have loaders for common file formats such as libsvm's or weka's.

The loaders should have two modes: batch and online. In the latter case, we could have an iterator that spits X matrices of a given chunk size (suitable for partial_fit)

Problem with Sparse implementation of SVM

I am using the scikits.learn on very large sparse data. I have problems when using the sparse SVM with the 'poly' kernel. I attach a simple test case based on the iris example from the scikits.learn website. In this example I use 'linear' and 'poly' kernels both using the dense and sparse implementations. As the graphs show the 'linear' kernel gives similar results (sparse vs dense) but the sparse implementation of 'poly' gives wrong results.

I am using scikits.learn version 0.7.1, and I have tested it both on window 32bit and window 64bit implementations. I am using scipy version 0.8 on the win32 platform and scipy 0.9rc3 on the win64 platform.

"""
==================================================
Plot different SVM classifiers in the iris dataset
==================================================

Comparison of different linear SVM classifiers on the iris dataset. It
will plot the decision surface for four different SVM classifiers.

"""
print __doc__

import numpy as np
import pylab as pl
from scikits.learn import svm, datasets
import scipy as sp


# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
                     # avoid this ugly slicing by using a two-dim dataset
Xs = sp.sparse.lil_matrix( X ).tocsr()

Y = iris.target

h=.02 # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
svc     = svm.SVC(kernel='linear').fit(X, Y)
rbf_svc = svm.SVC(kernel='poly').fit(X, Y)
ssvc     = svm.sparse.SVC(kernel='linear').fit(Xs, Y)
srbf_svc = svm.sparse.SVC(kernel='poly').fit(Xs, Y)

# create a mesh to plot in
x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'SVC with polynomial (degree 3) kernel',
          'Sparse SVC with linear kernel',
          'Sparse SVC with polynomial (degree 3) kernel']


pl.set_cmap(pl.cm.Paired)

for i, clf in enumerate((svc, rbf_svc, ssvc, srbf_svc)):
    # Plot the decision boundary. For that, we will asign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    pl.subplot(2, 2, i+1)

    Xp = np.c_[xx.ravel(), yy.ravel()]

    if i > 1:
        Xp = sp.sparse.lil_matrix( Xp ).tocsr()

    Z = clf.predict( Xp )

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    pl.set_cmap(pl.cm.Paired)
    pl.contourf(xx, yy, Z)
    pl.axis('tight')

    # Plot also the training points
    pl.scatter(X[:,0], X[:,1], c=Y)

    pl.title(titles[i])

pl.axis('tight')
pl.show()

User Guide Rease 0.6, Section 3.2

The following sentence
SVMs perform classification as a function of some subset of the training data, called the support vectors. These vectors can be accessed in member support_:

seems to be intended to obtain support vectors. So, the term "support_" should be "support_vectors_"

The subsequent example is also supposed to be:
>>> clf.support_vectors_
array([[ 0., 0.],
[ 1., 1.]])

prebuilt docs tarball?

Given that building various images in the user guide requires one to download various large datasets, would it be possible to distribute a tarball containing a prebuilt copy of the docs (e.g., in html)? This would be helpful for scikit learn package maintainers for various distributions because it would obviate the need to include large datasets in the source packages in order to build the docs properly.

Crasher in SVR with probility=True

The following code creates a segfault:

from scikits.learn import svm, datasets

diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

svr = svm.SVR(probability=True)
svr.fit(X, y)

Without any surprise, inspecting the gdb traceback tells us that the segfault is in the call to libsvm_train on line 145 in svm/base.py. The first lines of the gdb backtrace are:

Program received signal SIGSEGV, Segmentation fault.
__memcpy_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:1360
1360    ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S: No such file or directory.
    in ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S
(gdb) bt
#0  __memcpy_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:1360
#1  0x015ff843 in copy_probB (data=0x8a594b0 "h\344G", model=0x89e0f68, dims=0x89d1268)
    at /usr/include/bits/string3.h:52
#2  0x016147ac in __pyx_pf_7_libsvm_libsvm_train (__pyx_self=0x0, __pyx_args=
    (, , 3, 2, 3, , , , , , , , , , , 1, 1), __pyx_kwds=0x0)
    at scikits/learn/svm/src/libsvm/_libsvm.c:2111
#3  0x080ddd23 in call_function (f=
    Frame 0x8a43d9c, for file /home/varoquau/dev/scikit-learn/scikits/learn/svm/base.py, line 150, in fit (self=, probability=True, degree=3, shrinking=True, class_weight_label=, p=, impl='epsilon_svr', tol=, cache_size=, coef0=, nu=, gamma=, class_weight=) at remote 0x8a0498c>, X=, y=, class_weight={}, sample_weight=, params={}, kernel_type=2, _X=, solver_type=3), throwflag=0)
    at ../Python/ceval.c:3750

Implement Multinomial Naive Bayes

Multinomial Naive Bayes is a simple algorithm which scales well and has probabilistic output.

use_prior=True/False would be a nice option in the constructor.

In case loops are slow, use Cython.

New Analyzer objects for text feature extraction

Currently, we have three level of objects for text feature extraction:

  • Preprocessor
  • Analyzer
  • Vectorizer

In this proposal, we would like to merge Preprocessor into Analyzer as well as introducing new methods. This should give more flexibility to the user for supporting different (natural) languages.

An analyzer should implement 4 methods:

  • preprocess(str) => str
    Preprocessing such as punctation removal, lower case conversion. Language dependent.
  • tokenize(str) => list
    Split a string into tokens (words or characters). Language dependent.
  • postprocess(list) => list
    From a list of tokens, output a list of n-grams. Language independent (unless further post-processing is needed).
  • analyze(iter) => iter
    Pass a collection of documents (str, file, io) through the whole chain preprocess => tokenize => postprocess. Since an iterator is returned, CountVectorizer should convert it to a list but HashingCountVectorizer won't have.

The class hierarchy could look like this:

  • Analyzer (implements postprocess)
    • RomanAnalyzer (implements preprocess)
      • RomanWordAnalyzer (implements tokenize)
      • RomanCharacterAnalyzer (implements tokenize)

Furthermore, we can have an EnglishWordAnalyzer to handle things like stop words removal and more elaborate processing for English syntax.

ChineseWordAnalyzer and JapaneseWordAnalyzer will likely require external dependencies (library, dictionary/probabilistic model). Thus they are out of the scope of the project but we may want to provide them in a gist.

LARS module broken

  • the parameter names are different in docstring
  • error below

/software/python/nipype0.3/lib/python2.6/site-packages/scikits.learn-0.6_git-py2.6-linux-x86_64.egg/scikits/learn/glm/base.pyc in predict(self, X)
40 """
41 X = np.asanyarray(X)
---> 42 return np.dot(X, self.coef_) + self.intercept_

lfw not working on scipy 0.8.0

In my windows box I get:

arn\datasets\lfw.py", line 32, in <module>
    from scipy.misc import imread
ImportError: cannot import name imread

command line interface

Not high priority but would be nice to have a command line interface. Some possible features:

  • Input in various formats (libsvm/svmlight's sparse format, arff, raw
    documents, ...)
  • Pipeline (transformers and estimator)
  • Model selection
  • Evaluation on a test set
  • (Plots?)
  • Model persistence (pickle)

Examples:

$ skl fit --format svmlight --model model.pickle preprocessing.Scaling
pca.PCA svm.LinearSVC --input training_data.txt

$ skl predict --format svmlight --model model.pp --input
test_data.txt --output predictions.txt

when --input is not provided, the input is read from stdin.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.