Giter Club home page Giter Club logo

py-earth's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

py-earth's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

py-earth's Issues

Examples failing, or Not Compatible with Windows

Both of the usage examples result in the same error for me..

import numpy
import cProfile
from pyearth import Earth
from matplotlib import pyplot

numpy.random.seed(2)
m = 1000
n = 10
X = 80_numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1_numpy.random.normal(size=m)
model = Earth(max_degree = 1)
model.fit(X,y)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyearth\earth.py", line 312, in fit
self.forward_pass(X, y)
File "C:\Python27\lib\site-packages\pyearth\earth.py", line 383, in forward_pass
forward_passer = ForwardPasser(X, y, **args)
File "_forward.pyx", line 67, in pyearth._forward.ForwardPasser.init (pyearth/_forward.c:3146)
File "_forward.pyx", line 96, in pyearth._forward.ForwardPasser.init_linear_variables (pyearth/_forward.c:3698)
ValueError: Buffer dtype mismatch, expected 'INT_t' but got 'long long'

py-earth fails for multi-column regression

py-Earth fails when fitting a multi column [m x n] label matrix with the two-dimensional training data. This can be considered as an enhancement

model.fit(X_train, y_train)
File "/Library/Python/2.7/site-packages/pyearth/earth.py", line 331, in fit
X, y, sample_weight = self._scrub(X, y, sample_weight)
File "/Library/Python/2.7/site-packages/pyearth/earth.py", line 262, in _scrub
y = y.reshape(y.shape[0])
ValueError: total size of new array must be unchanged

Implement upfront data validation/conversion

Currently there is no checking on input data. If you put in data in the wrong format, the first time you know about it might be when you get a segfault or, worse, an incorrect model.

Pass order vectors around instead of reordering data

Currently, the entire data set is reordered for each variable at each iteration. This process is currently the biggest performance bottleneck by far. It would be better if all variable orders were determined upfront and stored, then passed into valid_knots and best_knot. This requires modification to those two methods. It will probably make both methods slower, but hopefully not as slow as reordering the data all the time.

Build fails because of missing file

I might be missing something but setup.py includes the examples/vFunctionExample.py which is not in the repository.

'scripts':['examples/vFunctionExample.py'],

Change the way xlabels are specified

The xlabels argument is currently treated as a hyperparameter. It should probably be treated differently. Issues:

  1. In the Pandas case, the labels come with the data and should therefore be set in fit.
  2. In the Pandas case, column order may be less relevant than column name.

At the very least, I should make it easy to set xlabels in the fit method.

Pyearth stuck in fitting

Dear Jason,

I just tried a couple of fitting examples on which to apply pyearth.

This is the code:
import numpy
from pyearth import Earth
from matplotlib import pyplot

Create some fake data

numpy.random.seed(seed = 0)

N = 100
X = numpy.linspace(-10, 10, N)
numpy.random.shuffle(X)
y = numpy.zeros(X.shape)
sigma = 1
for i in range(y.size):
r = numpy.random.normal(1, sigma)
y[i] = numpy.sin(X[i]) + r

Fit an Earth model

model = Earth()
print 'fitting'
model.fit(X,y)
print 'done'

print "\n*** SUMMARY ***"

print model.summary()

Plot the model

y_hat = model.predict(X)
pyplot.figure()
pyplot.plot(X,y,'r.')
pyplot.plot(X,y_hat,'b.')
pyplot.show()

It generates a noisy sin function and tries to fit it. There an issue I'd like to point out. If the number of samples is not large enough, e.g. N = 100, the algorithm may crash, and raise a segmentation fault. Its success depends on the data. If I do not set the random seed, then sometimes it crashes and sometimes it does not.

I hope this information may help you!

Best,

Marco

The coefficients aren't coming out right

y <- abs(x6 - 4) + E

Gives this result:

Forward Pass

iter parent var knot mse terms gcv rsq grsq

0 - - - 150.123858 1 150.425 0.000 0.000
1 0 6 431 0.911382 3 0.928 0.994 0.994
2 0 9 981 0.903521 5 0.935 0.994 0.994
Stopping Condition: 2

Pruning Pass

iter bf terms mse gcv rsq grsq

0 - 5 0.90 0.935 0.994 0.994
1 4 4 0.91 0.933 0.994 0.994
2 3 3 0.91 0.928 0.994 0.994
3 2 2 140.46 141.875 0.064 0.057
4 1 1 150.12 150.425 0.000 0.000
Selected iteration: 2

Earth Model

(Intercept) -0.022567565678
h(x6-4.0982) -0.0736813881288
h(4.0982-x6) -0.0328571125806
h(x9+38.6904) -0.0066944613208

Traceback in import pyearth when using sklearn.__version__ = "0.16-git"

I saw this issue on Ubuntu 12.04 (LTS Kernel) with a source install of sklearn (version 0.16-git). The setup process completes correctly, but I see the following traceback when importing pyearth:

>>> import pyearth
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyearth/__init__.py", line 8, in <module>
    from .earth import Earth
  File "/usr/local/lib/python2.7/dist-packages/pyearth/earth.py", line 5, in <module>
    from sklearn.utils.validation import assert_all_finite, safe_asarray
ImportError: cannot import name safe_asarray

It looks like sklearn.utils.validation.safe_asarray is an internal tool to sklearn, and as mentioned on their website, is not guaranteed to be stable between releases.

Optimize reorderxby with BLAS for speed

The _utilreorderxby function is a bottleneck for large sample sizes. Using BLAS for the row copies and swaps might speed it up. Or, is there a faster algorithm?

Support non-diagonal weight matrices

This can be accomplished by premultiplying basis functions by the Cholesky square root of the weight matrix (a generalization of how diagonal weight matrices are currently handled). See equation (7) in [1].

[1] Green, P. J. (2011). Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives. Journal of The Royal Statistical Society, Series B (Methodological), 46(2), 149–192.

Add PMML support

The ability to read and write models in PMML would do a lot to improve portability.

pip fails to install at same time of numpy

If neither numpy nor pyearth are installed, and we install both of them by:

$ pip install -r requirements.txt

where requirements.txt:

numpy
scipy
-e git+https://github.com/jcrudy/py-earth.git#egg=py-earth

Throws ImportError: no module named numpy when installing pyearth. Refer to this issue and this issue to see how pip works in this case.

installation on windows

Hi,

I find it hard to install this package in windows.
I have anaconda ,numpy installed and cygwin installed.
the installations stalls with "c:....\gcc.exe failed with exit status 1"

thanks
Madhu

Add user specified linear terms

Currently the ForwardPasser automatically decides whether a variable should enter linearly. There should also be a way for users to specify that a variable should enter linearly.

Earth.fit can't handle non-continuous X columns

It produces nans. I'm guessing this has to do with adjacent knot candidates being exactly equal, which generally only happens with non-continuous distributions, but might also happen with zero-padded data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.