Giter Club home page Giter Club logo

scikit-learn's Introduction

Azure CirrusCI Codecov CircleCI Nightly wheels Black PythonVersion PyPi DOI Benchmark

https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/doc/logos/scikit-learn-logo.png

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

It is currently maintained by a team of volunteers.

Website: https://scikit-learn.org

Installation

Dependencies

scikit-learn requires:

  • Python (>= 3.9)
  • NumPy (>= 1.19.5)
  • SciPy (>= 1.6.0)
  • joblib (>= 1.2.0)
  • threadpoolctl (>= 3.1.0)

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 1.0 and later require Python 3.7 or newer. scikit-learn 1.1 and later require Python 3.8 or newer.

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with Display) require Matplotlib (>= 3.3.4). For running the examples Matplotlib >= 3.3.4 is required. A few examples require scikit-image >= 0.17.2, a few examples require pandas >= 1.1.5, some examples require seaborn >= 0.9.0 and plotly >= 5.14.0.

User installation

If you already have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

or conda:

conda install -c conda-forge scikit-learn

The documentation includes more detailed installation instructions.

Changelog

See the changelog for a history of notable changes to scikit-learn.

Development

We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The Development Guide has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README.

Important links

Source code

You can check the latest sources with the command:

git clone https://github.com/scikit-learn/scikit-learn.git

Contributing

To learn more about making a contribution to scikit-learn, please see our Contributing guide.

Testing

After installation, you can launch the test suite from outside the source directory (you will need to have pytest >= 7.1.2 installed):

pytest sklearn

See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage for more information.

Random number generation can be controlled during testing by setting the SKLEARN_SEED environment variable.

Submitting a Pull Request

Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: https://scikit-learn.org/stable/developers/index.html

Project History

The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors.

The project is currently maintained by a team of volunteers.

Note: scikit-learn was previously referred to as scikits.learn.

Help and Support

Documentation

Communication

Citation

If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn

scikit-learn's People

Contributors

adam2392 avatar adrinjalali avatar agramfort avatar ahojnnes avatar amueller avatar arjoly avatar glemaitre avatar glouppe avatar jakevdp avatar jeremiedbb avatar jjerphan avatar jnothman avatar larsmans avatar lesteve avatar lorentzenchr avatar lucyleeow avatar mblondel avatar mechcoder avatar nellev avatar nicolashug avatar ogrisel avatar pprett avatar qinhanmin2014 avatar raghavrv avatar robertlayton avatar rth avatar scikit-learn-bot avatar thomasjpfan avatar tomdlt avatar vene avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scikit-learn's Issues

Analyzing the performance of different clustering algorithms with increasing dimensions

Aim: Testing how the performance of different clustering algorithms for different datasets change on adding noise with different dimensions:

To be done: A jupyter notebook documentation describing the effect of the addition of different dimensions of noise on a dataset. Here different types of synthetic datasets are generated on which the experiment is performed. To these datasets gaussian noise of different dimensions are added, and the performance of each clustering algorithm is measured after noise addition. This is repeated for noise with different variances.

Expected output: The plots that compare the effect of varying noise dimensions on different clustering algorithms for each of the datasets. In this set of subplots, the variance of the added noise changes along the column and the dataset changes along the row.

Link to the code: https://nbviewer.jupyter.org/github/sree0917/team-forbidden-forest/blob/master/Sree/Clustering%20comparison%20%281%29.ipynb

Create an internal dictionary for BaseDecisionTree

problem

Rn, we have to override and copy a lot of custom code to make fit and partial_fit work in subclasses of BaseDecisionTree inside sktree.

possible soln

We should track the kwarg parameters needed to instantiate the:

  • criterion
  • splitter
  • tree

These should be then easily accessible in subclasses. E.g.

class BaseDecisionTree
      _criterion_kwargs = ['n_samples', ...']
      _splitter_kwargs = ['criterion', 'max_features', ...]
...

Then ideally a subclass just has to add these additional kwargs to the __init__ structure and then override the corresponding _criterion/splitter/tree_kwargs.

cc: @PSSF23 from our discussion

Easy way to build wheel out of Scikit-Learn-Tree ?

Describe the workflow you want to enable

First and foremost, we appreciate your NeuroData work focused on Scikit-Learn trees. This is brilliant. Anyway, we did some tree modifications, but it surely compiles on Linux and MacOS without much trouble, so we built the wheel on MacOS (local) and re-used it in our dependencies instead of recompiling the whole Scikit-Learn fork, and it works, thus speeding up the installation of our project, which uses an update of Scikit-Learn-Tree. However, in our Docker container, doing the same appears to produce a problem with the wheel generated after installation.

Is there a method to create the wheels we want (MacOS and Linux) so that our end users can use both (open source) instead of having to re compile the whole thing?

Cheers

Describe your proposed solution

None

Describe alternatives you've considered, if relevant

Did not find any yet

Additional context

No response

Checking the performance of classifiers in high dimensional noise setting.

Description

The sklearn's example on comparing the different classifier accuracies does not have multiple settings for testing various scenarios.
There is no concrete example showing when some of these algorithms win and when they lose.
One scenario to consider is - given a dataset of a relatively low order dimension, how does the accuracy of classifiers change with respect to the addition of noise dimensions.

Noise dimensions are any features added across the dimensions of the dataset which bears no relevance to the original signal dimensions.

Goal
To check the performance of Random Forest, Support Vector Machine and K Nearest Neighbours as three different classifiers for the additions of gaussian noise across three different variance values.

Proposed changes in the form of PR
I am proposing a new tutorial in the form of a jupyter notebook containing all the code from data generation to the computation of accuracies across noise dimensions.
The final figure will contain a plot of the original datasets adopted from https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
and 9 different plots of "Accuracy Vs Number of Noise Dimensions" for the 3 different datasets and 3 different variances of gaussian noise. The plot will containing the testing accuracies across 50 trials of the experiment.

Here is a link to the code:
https://github.com/NeuroDataDesign/team-forbidden-forest/blob/master/Sahana/FINAL_PR_classifiers.ipynb

[ENH] Adding binning capabilities to decision trees

Describe the workflow you want to enable

I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].

By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.

Describe your proposed solution

We propose introducing binning to the decision tree classifier and regressor.

An initial PR is proposed here: #24 (review)
However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.

Changes to the Cython codebase

TBD

Changes to the Python API

The following two parameters would be added to the DecisionTreeClassifier and Regressor:

hist_binning=False,
max_bins=255

where the default number of bins follows that of histgradient boosting.

Additional context

These changes can also trivially be applied to Oblique Trees.

References:
[1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

Segmentation faults during `partial_fit` for `check_fit_score_takes_y` unit-test in scikit-learn

Describe the bug

During the unit-test of RandomForestClassifier and ExtraTreesClassifier, it seems that there is some segmentation fault arising from partial_fit tree builder build function.

I suspect, this has something to do with either:

i) pickling and then accessing something that isn't restored properly
ii) accessing memory that isn't allocated

It would be great if it could be reproduced, but I fail to do so locally. This suggests to me it could be an edge case, but then that's even more important to fix.

Steps/Code to Reproduce

TBD.

Expected Results

There should be no segmentation fault.

Actual Results

.....................................................Fatal Python error: Segmentation fault

Thread 0x00007000107d4000 (most recent call first):
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/threading.py", line 320 in wait
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/site-packages/joblib/externals/loky/backend/queues.py", line 113 in _feed
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/threading.py", line 975 in run
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/threading.py", line 995 in _bootstrap

Thread 0x000070000f7d1000 (most recent call first):
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/selectors.py", line 415 in select
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/multiprocessing/connection.py", line 930 in wait
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 611 in wait_result_broken_or_wakeup
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 557 in run
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
  File "/usr/local/miniconda/envs/testvenv/lib/python3.11/threading.py", line 995 in _bootstrap

Versions

sklearn submodulev3 branch

Build wheels for the fork of scikit-learn here

We want a CI pipeline (and/or local pipeline) that can build wheels for:

  • windows
  • Linux
  • Mac Intel (x64)
  • Mac M1 (Arm)

and we can attach those wheels to a specific release we make (these are called "nightly wheels" in scipy/pytorch/tensorflow/etc.). Then we can pip install directly from those wheels. Use https://cibuildwheel.readthedocs.io/en/stable/.

Then create a stable release that has all these wheels pip installable

Then have sktree rely on these wheels for v0.1

I added more notes on the call, feel free to add them here for documentation @jshinm.

Spectral Embedding with Asymmetric Matrices/ Directed Graphs

Describe the workflow you want to enable

Currently, the sklearn.manifold.SpectralEmbedding is restricted to symmetric affinity matrices, and if an asymmetric matrix is passed it is converted through sklearn.utils.validation.check_symmetric into a symmetric matrix. However, in doing so, one loses the underlying asymmetries and potential directional clusters present in the adjacency matrix of the directed graph input.

Describe your proposed solution

The algorithms I propose adding use singular value decomposition, as opposed to eignendecomposition, and a modified Laplacian to perform spectral embedding on directed graphs/asymmetric matrices. Specifically I would like to propose adding Adjacency / Laplacian spectral embedding, ASE and LSE respectively. My thoughts would be to add a new class,sklearn.manifold.DirectedSpectralEmbedding, which users may call directly, or is dispatched to when an asymmetric matrix is passed to sklearn.manifold.SpectralEmbedding. Similarly to SpectralEmbedding, users would specify between ASE and LSE through an affinity parameter.

Additional context

These algorithms have been implemented in GraSPy, (available at ASE and LSE), inputing a graph represented as a dense or sparse matrix, and returning the appropriate embedding.

Analyzing effect of dimensionality reduction on accuracy of different classifiers on different types of datasets

It will be a document showing the Effect of Dimensionality Reduction on the accuracy of different classifiers.
The document will have simulations On High dimensional dataset of different shapes:
Each dataset is synthesized from sklearn established Datasets. Each dataset has 1000 dimensions with only 2 dimensions of data and rest are noise dimensions.

Questions to answer:

  1. Analyzing how dimensionality reduction helps in classification for different classifiers.
  2. Analyzing how classifiers perform with a different number of reduced datasets from the main high dimensional dataset.

Pipeline to be followed:

  1. Defining a dataset with sklearn established synthetic datasets with High dimensions.
  2. Performing classification on data and measuring accuracy for quantification of the process.
  3. performing the Dimensionality reduction technique keeping varying numbers of reduced dimensions.
  4. Checking the performance of classification again after reducing dimension after each iteration.

The output of the PR would be a figure showing different datasets, comparing accuracies of different classifiers with and without dimensionality reduction and a plot showing varying accuracies over reduced dimensions.

Experiments to follow:
https://github.com/NeuroDataDesign/team-forbidden-forest/blob/master/Parimal%20Joshi/Final_pr_2.ipynb

MAINT Refactor `partial_fit` Cython code to be more maintainable

Some segfaults arose in scikit-tree during the implementation of this PR: neurodata/treeple#249.
Some actionable items came to mind to improve what we have:

Documentation:

cpdef initialize_node_queue(
self,
Tree tree,
object X,
const float64_t[:, ::1] y,
const float64_t[:] sample_weight=None,
const unsigned char[::1] missing_values_in_feature_mask=None,
):
"""Initialize a list of roots"""
X, y, sample_weight = self._check_input(X, y, sample_weight)
# organize samples by decision paths
paths = tree.decision_path(X)
cdef intp_t PARENT
cdef intp_t CHILD
cdef intp_t i
false_roots = {}
X_copy = {}
y_copy = {}
for i in range(X.shape[0]):
# collect depths from the node paths
depth_i = paths[i].indices.shape[0] - 1
PARENT = depth_i - 1
CHILD = depth_i
# find leaf node's & their parent node's IDs
if PARENT < 0:
parent_i = 0
else:
parent_i = paths[i].indices[PARENT]
child_i = paths[i].indices[CHILD]
left = 0
if tree.children_left[parent_i] == child_i:
left = 1 # leaf node is left child
# organize samples by the leaf they fall into (false root)
# leaf nodes are marked by parent node and
# their relative position (left or right child)
if (parent_i, left) in false_roots:
false_roots[(parent_i, left)][0] += 1
X_copy[(parent_i, left)].append(X[i])
y_copy[(parent_i, left)].append(y[i])
else:
false_roots[(parent_i, left)] = [1, depth_i]
X_copy[(parent_i, left)] = [X[i]]
y_copy[(parent_i, left)] = [y[i]]
X_list = []
y_list = []
# reorder the samples according to parent node IDs
for key, value in reversed(sorted(X_copy.items())):
X_list = X_list + value
y_list = y_list + y_copy[key]
cdef object X_new = np.array(X_list)
cdef cnp.ndarray y_new = np.array(y_list)
# initialize the splitter using sorted samples
cdef Splitter splitter = self.splitter
splitter.init(X_new, y_new, sample_weight, missing_values_in_feature_mask)
# convert dict to numpy array and store value
self.initial_roots = np.array(list(false_roots.items()))
should ideally get rewritten, or at least more comments. Rn it is hard to parse what comprises initial_roots. Since this is part of the Cython codebase, it is thus a critical piece as segfaults are time-consuming and difficult to chase down.

Next, we prolly want to include a clear description for developers on what the differences are here:

if initial_roots is None:
# Recursive partition (without actual recursion)
splitter.init(X, y, sample_weight, missing_values_in_feature_mask)
if tree.max_depth <= 10:
init_capacity = <intp_t> (2 ** (tree.max_depth + 1)) - 1
else:
init_capacity = 2047
tree._resize(init_capacity)
first = 1
else:
# convert numpy array back to dict
false_roots = {}
for key_value_pair in initial_roots:
false_roots[tuple(key_value_pair[0])] = key_value_pair[1]
# reset the root array
self.initial_roots = None

Features, or documentation
Part of sklearn handles monotonic constraints and n_constant_features tracking. From first glance it is also not clear that these are actually tracked. I.e. is the monotonic constraint and n_constant_features if we do fit and then partial_fit for two subsets of the data different from if we just did fit for the entire dataset? If they are different, what does this imply?

In an ideal world the state is the same.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.