scikit-learn-contrib / py-earth Goto Github PK

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines

Home Page: http://contrib.scikit-learn.org/py-earth/

License: BSD 3-Clause "New" or "Revised" License

Python 99.80% Makefile 0.16% Batchfile 0.02% Shell 0.03%

py-earth's Introduction

scikit-learn-contrib

scikit-learn-contrib is a github organization for gathering high-quality scikit-learn compatible projects. It also provides a template for establishing new scikit-learn compatible projects.

Vision

With the explosion of the number of machine learning papers, it becomes increasingly difficult for users and researchers to implement and compare algorithms. Even when authors release their software, it takes time to learn how to use it and how to apply it to one's own purposes. The goal of scikit-learn-contrib is to provide easy-to-install and easy-to-use high-quality machine learning software. With scikit-learn-contrib, users can install a project by pip install sklearn-contrib-project-name and immediately try it on their data with the usual fit, predict and transform methods. In addition, projects are compatible with scikit-learn tools such as grid search, pipelines, etc.

Projects

If you would like to include your own project in scikit-learn-contrib, take a look at the workflow.

DenMune: Density-peak clustering using mutual nearest neighbors

A simple-but-efficient density-based clustering algorithm that can find clusters of arbitrary size, shapes and densities in two-dimensions. Higher dimensions are first reduced to 2-D using the t-sne. The algorithm relies on a single parameter K, the number of nearest neighbors.

Read The Docs, Read the Paper

Maintained by: Mohamed Abbas

lightning

Large-scale linear classification, regression and ranking.

Maintained by Mathieu Blondel and Fabian Pedregosa.

skglm

Fast and modular Generalized Linear Models with support for models missing in scikit-learn.

Maintained by Mathurin Massias, Pierre-Antoine Bannier, Quentin Klopfenstein and Quentin Bertrand.

py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines.

Maintained by Jason Rudy and Mehdi.

imbalanced-learn

Python module to perform under sampling and over sampling with various techniques.

Maintained by Guillaume Lemaitre, Fernando Nogueira, Dayvid Oliveira and Christos Aridas.

polylearn

Factorization machines and polynomial networks for classification and regression in Python.

Maintained by Vlad Niculae.

forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms.

Maintained by Ariel Rokem, Kivan Polimis and Bryna Hazelton.

hdbscan

A high performance implementation of HDBSCAN clustering.

Maintained by Leland McInnes, jc-healy, c-north and Steve Astels.

categorical-encoding

A library of sklearn compatible categorical variable encoders.

Maintained by Will McGinnis and Paul Westenthanner

boruta_py

Python implementations of the Boruta all-relevant feature selection method.

Maintained by Daniel Homola

sklearn-pandas

Pandas integration with sklearn.

Maintained by Israel Saeta Pérez

skope-rules

Machine learning with logical rules in Python.

Maintained by Florian Gardin, Ronan Gautier, Nicolas Goix and Jean-Matthieu Schertzer.

stability-selection

A Python implementation of the stability selection feature selection algorithm.

Maintained by Thomas Huijskens

metric-learn

Metric learning algorithms in Python.

Maintained by CJ Carey, Yuan Tang, William de Vazelhes, Aurélien Bellet and Nathalie Vauquier.

py-earth's People

Stargazers

Watchers

Forkers

mehdidc marco-santoni lukesmith zopapublic pmukerji colcarroll ducquang1 ekoziol josmanc vermouthmjl rmaestre dexterzt pierrearb biodun cchang-aa kwinkunks nokia ghostintheshellarise mattlewissf eric-czech leezqcst samucc peterbarkat hansihuang2016 emphead zhanggrb sutterkip sandy4321 haoybl tukuan1992 i4mk4pil sunshine1204 rsmahabir ml-tools drudel enlighter p768lwy3 charryzzz jiapei100 smo216 jaynoel timocb shepherdmeng ancardona verycherry louisxw gsbdbi guillemdb lahwran chenlu-hung ivantay2003 marksien joe-cipolla emptymountain peds rocapp slyderek kevin-dietz cherishing99 liudefu pandinosaurus kutoga synergy-robotics-a-b email4reg kevinzxwang lgrosset flop-py merz9b dionresearch 321hg kseh92 cielamber qiuweibin2005 dogaince rishirelan fredlauf bingoko dzungtum junjie2008v nikokydes jonathangrant jonathangrantts magamig vishalbelsare iwhalvic zzg1421757931 zqcr ktanishqk fahd-siddiqui dpalpha lkampoli weiyujie1996 kravulap apatange-source mjuddbooth thorenslove harisvidimlic19 cevikgazi magnusbitsch bmreiniger

py-earth's Issues

Make Earth objects pickle compatible

Currently pickling raises a TypeError

Complete unit tests

Increase test coverage

Support sample weights

The algorithm is not behaving correctly when the sample_weight parameter is used

Specifically, there is a bug in the way new basis functions are added at the end of next_pair after the new parent, variable, and knot location have been selected.

Add normalization to improve numerical stability

Change so that references to X and y are not stored by Earth or any of its attributes

Probably the best way is to not store the ForwardPasser or PruningPasser objects, but rathe just store their records.

Examples failing, or Not Compatible with Windows

Both of the usage examples result in the same error for me..

import numpy
import cProfile
from pyearth import Earth
from matplotlib import pyplot

numpy.random.seed(2)
m = 1000
n = 10
X = 80_numpy.random.uniform(size=(m,n)) - 40
y = numpy.abs(X[:,6] - 4.0) + 1_numpy.random.normal(size=m)
model = Earth(max_degree = 1)
model.fit(X,y)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyearth\earth.py", line 312, in fit
self.forward_pass(X, y)
File "C:\Python27\lib\site-packages\pyearth\earth.py", line 383, in forward_pass
forward_passer = ForwardPasser(X, y, **args)
File "_forward.pyx", line 67, in pyearth._forward.ForwardPasser.init (pyearth/_forward.c:3146)
File "_forward.pyx", line 96, in pyearth._forward.ForwardPasser.init_linear_variables (pyearth/_forward.c:3698)
ValueError: Buffer dtype mismatch, expected 'INT_t' but got 'long long'

py-earth fails for multi-column regression

py-Earth fails when fitting a multi column [m x n] label matrix with the two-dimensional training data. This can be considered as an enhancement

model.fit(X_train, y_train)
File "/Library/Python/2.7/site-packages/pyearth/earth.py", line 331, in fit
X, y, sample_weight = self._scrub(X, y, sample_weight)
File "/Library/Python/2.7/site-packages/pyearth/earth.py", line 262, in _scrub
y = y.reshape(y.shape[0])
ValueError: total size of new array must be unchanged

GCV as implemented does not properly handle sample_weight or linear (non-hinged) basis functions

I need to figure out what the correct formula should be.

Write scikit-learn compatible docs

There needs to be documentation.

Implement upfront data validation/conversion

Currently there is no checking on input data. If you put in data in the wrong format, the first time you know about it might be when you get a segfault or, worse, an incorrect model.

Incorrect result when there are no valid knot candidates

Pass order vectors around instead of reordering data

Currently, the entire data set is reordered for each variable at each iteration. This process is currently the biggest performance bottleneck by far. It would be better if all variable orders were determined upfront and stored, then passed into valid_knots and best_knot. This requires modification to those two methods. It will probably make both methods slower, but hopefully not as slow as reordering the data all the time.

Make unit tests scikit-learn compatible

Unit tests are currently not very scikit-learn friendly

Add MARS for categorical variables

See the other Friedman, 1991, entitled "Estimating functions of mixed ordinal and categorical variables using adaptive splines".

Build fails because of missing file

I might be missing something but setup.py includes the examples/vFunctionExample.py which is not in the repository.

'scripts':['examples/vFunctionExample.py'],

Change all ndarrays to memory views in the Cython code

Add variable importance

Change the way xlabels are specified

The xlabels argument is currently treated as a hyperparameter. It should probably be treated differently. Issues:

In the Pandas case, the labels come with the data and should therefore be set in fit.
In the Pandas case, column order may be less relevant than column name.

At the very least, I should make it easy to set xlabels in the fit method.

Pyearth stuck in fitting

Dear Jason,

I just tried a couple of fitting examples on which to apply pyearth.

This is the code:
import numpy
from pyearth import Earth
from matplotlib import pyplot

Create some fake data

numpy.random.seed(seed = 0)

N = 100
X = numpy.linspace(-10, 10, N)
numpy.random.shuffle(X)
y = numpy.zeros(X.shape)
sigma = 1
for i in range(y.size):
r = numpy.random.normal(1, sigma)
y[i] = numpy.sin(X[i]) + r

Fit an Earth model

model = Earth()
print 'fitting'
model.fit(X,y)
print 'done'

print "\n* SUMMARY *"

print model.summary()

Plot the model

y_hat = model.predict(X)
pyplot.figure()
pyplot.plot(X,y,'r.')
pyplot.plot(X,y_hat,'b.')
pyplot.show()

It generates a noisy sin function and tries to fit it. There an issue I'd like to point out. If the number of samples is not large enough, e.g. N = 100, the algorithm may crash, and raise a segmentation fault. Its success depends on the data. If I do not set the random seed, then sometimes it crashes and sometimes it does not.

I hope this information may help you!

Best,

Marco

Exactly linearly dependent inputs produce nans

I need to find a way to deal with linearly dependent inputs.

The coefficients aren't coming out right

y <- abs(x6 - 4) + E

Gives this result:

Forward Pass

iter parent var knot mse terms gcv rsq grsq

0 - - - 150.123858 1 150.425 0.000 0.000
1 0 6 431 0.911382 3 0.928 0.994 0.994
2 0 9 981 0.903521 5 0.935 0.994 0.994
Stopping Condition: 2

Pruning Pass

iter bf terms mse gcv rsq grsq

0 - 5 0.90 0.935 0.994 0.994
1 4 4 0.91 0.933 0.994 0.994
2 3 3 0.91 0.928 0.994 0.994
3 2 2 140.46 141.875 0.064 0.057
4 1 1 150.12 150.425 0.000 0.000
Selected iteration: 2

Earth Model

(Intercept) -0.022567565678
h(x6-4.0982) -0.0736813881288
h(4.0982-x6) -0.0328571125806
h(x9+38.6904) -0.0066944613208

Write documentation and examples

Make stopping condition reporting more informative in the forward pass

Have some kind of string or message instead of just a number.

Support pandas DataFrame as input

Traceback in import pyearth when using sklearn.version = "0.16-git"

I saw this issue on Ubuntu 12.04 (LTS Kernel) with a source install of sklearn (version 0.16-git). The setup process completes correctly, but I see the following traceback when importing pyearth:

>>> import pyearth
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyearth/__init__.py", line 8, in <module>
    from .earth import Earth
  File "/usr/local/lib/python2.7/dist-packages/pyearth/earth.py", line 5, in <module>
    from sklearn.utils.validation import assert_all_finite, safe_asarray
ImportError: cannot import name safe_asarray

It looks like sklearn.utils.validation.safe_asarray is an internal tool to sklearn, and as mentioned on their website, is not guaranteed to be stable between releases.

How to dump model formula ?

Is there a way to dump the model formula ?

The endspan parameter is not used correctly

Specifically, user defined endspans are not used.

test-functions.cc and gflags

Optimize reorderxby with BLAS for speed

The _utilreorderxby function is a bottleneck for large sample sizes. Using BLAS for the row copies and swaps might speed it up. Or, is there a faster algorithm?

Better access to model parameters and statistics

This is a very basic feature. It should be easy for users to get the parameters of the fitted model, such as basis function coefficients, and statistics like R^2 or GCV.

Support non-diagonal weight matrices

This can be accomplished by premultiplying basis functions by the Cholesky square root of the weight matrix (a generalization of how diagonal weight matrices are currently handled). See equation (7) in [1].

[1] Green, P. J. (2011). Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robust and Resistant Alternatives. Journal of The Royal Statistical Society, Series B (Methodological), 46(2), 149–192.

Add anova-like functionality for terms

Make installation possible without Cython

Add a cythonize option to setup.py, but it should not be the default option. By default just compile the .c files.

Add docstrings

All the publicly visible methods should have docstrings.

Add PMML support

The ability to read and write models in PMML would do a lot to improve portability.

pip fails to install at same time of numpy

If neither numpy nor pyearth are installed, and we install both of them by:

$ pip install -r requirements.txt

where requirements.txt:

numpy
scipy
-e git+https://github.com/jcrudy/py-earth.git#egg=py-earth

Throws ImportError: no module named numpy when installing pyearth. Refer to this issue and this issue to see how pip works in this case.

Remove unused basis functionality

Such as the transform and scale methods

Support multiple outcomes

installation on windows

Hi,

I find it hard to install this package in windows.
I have anaconda ,numpy installed and cygwin installed.
the installations stalls with "c:....\gcc.exe failed with exit status 1"

thanks
Madhu