nrdg / fracridge Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 12.0 7.7 MB

Fractional ridge regression

Home Page: https://nrdg.github.io/fracridge

License: BSD 2-Clause "Simplified" License

Python 58.53% MATLAB 39.02% Shell 2.45%

fracridge's People

Contributors

Stargazers

Watchers

Forkers

arokem srsteinkamp tianyuntianyun shin-kyoto iancharest hahahannes kiminsub tal-golan huang-shenyang dvtruongson rmallof marek1n

fracridge's Issues

Efficient leave-one-out cross-validation?

Is there an easy way of combing fractional RR with efficient leave-one-out cross-validation?

SKLearn has a documented implementation of LOOCV for standard RR:
https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09bcc2eaeba98f7e737aac2ac782f0e5f1/sklearn/linear_model/_ridge.py#L1432

Edit: the linked code should be reviewed in light of this issue:
scikit-learn/scikit-learn#18079
(TL;DR: although the scikit-learn code mentions GCV, an algebraic form of LOOCV is implemented).

_do_svd with many targets (>280) fails with ValueError: operands could not be broadcast together with shapes ...

Thanks for making this implementation publicly available, It looks like a very promising tool!

I wanted to give it a go on an fmri dataset which has a large amount of targets (i.e. voxels).

I'm getting an error message when trying to fit to more than ~280 targets. Here's the error message I get when trying 1000 targets:

train.shape, stimdesign_conv.shape

>> ((284, 1000), (284, 50))

n_alphas = 5
fracs = np.linspace(0,1, n_alphas)
fr = FracRidgeRegressor(
    fracs, 
    fit_intercept=False,
    normalize=False,
)
fr = fr.fit(stimdesign_conv, train)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-8a37e1c17735> in <module>
      6     normalize=False,
      7 )
----> 8 fr = fr.fit(stimdesign_conv, train)

~/.conda/envs/thingsmri_env/lib/python3.8/site-packages/fracridge/fracridge.py in fit(self, X, y, sample_weight)
    309         X, y, X_offset, y_offset, X_scale = self._validate_input(
    310             X, y, sample_weight=sample_weight)
--> 311         coef, alpha = fracridge(X, y, fracs=self.fracs, tol=self.tol,
    312                                 jit=self.jit)
    313         self.alpha_ = alpha

~/.conda/envs/thingsmri_env/lib/python3.8/site-packages/fracridge/fracridge.py in fracridge(X, y, fracs, tol, jit)
    149 
    150     # Calculate the rotation of the data
--> 151     selt, v_t, ols_coef = _do_svd(X, y, jit=jit)
    152 
    153     # Set solutions for small eigenvalues to 0 for all targets:

~/.conda/envs/thingsmri_env/lib/python3.8/site-packages/fracridge/fracridge.py in _do_svd(X, y, jit)
     62         ynew = uu.T @ y
     63 
---> 64     ols_coef = (ynew.T / selt).T
     65 
     66     return selt, v_t, ols_coef

ValueError: operands could not be broadcast together with shapes (1000,) (50,)

I'm guessing it could have something to do with fracridge._do_svd changing it's behavior depending on the memory demand.

I'm not sure if it's a bug or limiting the number of targets is the only solution.

Thanks in advance!

Broken link

Hi Ariel - Nice idea.

The Readme page link to the Matlab implementation is a 404.

Onwards,
Brian

Add a `jit` kwarg

It looks like numba sometimes has trouble crunching through very large data. Might be worth making the jit-compilation optional.

Something like:

def fracridge(X, y, fracs=None, tol=1e-6, jit=True):

How to reverse the predicted values back to the original scale?

Hello everyone, thank you very much for your work. I am a newcomer who has just started exploring ridge regression. I would like to ask if it's possible to reverse the predicted values obtained from FRR back to their original scale? In my research, the range of the dependent variable is quite large (e.g., from 5 to 50 points). Ultimately, I want to create a scatter plot of predicted scores against actual scores, and I hope that the scale of the predicted values is consistent with the actual values. I have successfully executed the following code using Matlab; however, I have observed that the scale of the prediction scores (i.e., "pred") differs from that of the input-dependent variable "y" (i.e., "y_train"). Do you have any assistance or advice to offer regarding this issue? Thank you so much in advance.

The matlab code:

beta_tmp= fracridge(feat_train,lambda,y_train); %lambda is a set of optimal lambda values corresponding to each y variable; beta_tmp is a #feature x #lambda x #behaviors

for b = 1:size(beta_tmp,3)
beta(:,b) = beta_tmp(:,b,b);
end

%% Predicting
pred = feat_test * beta; %pred: #sub x #behaviors

interp requires sorted inputs

np.interp requires non-decreasing inpuits. If the user provides an input such as [0.1, 0.3, 0.2], it might provide garbage as output. A solution could be to check, sort and warn the user if they provide input that doesn't make sense.

Documentation example needs a little bit more ... documentation

This one: https://github.com/nrdg/fracridge/blob/master/examples/plot_diabetes.py

missing imports and string error

Hi :)

I experienced two small errors. The first one is that an import of lapack is missing (https://github.com/nrdg/fracridge/blob/master/fracridge/_linalg.py#L102) and the second one is that the splitting of the too long line (https://github.com/nrdg/fracridge/blob/master/fracridge/fracridge.py#L39) throws an error.

docs v1.3 have typo for python installation

docs v1.3 swapped the "d" and "i", so the package wont install if you copy paste.

pip install fracrdige

should be

pip install fracridge

potential speed gains with 'f' order for BLAS

i'm wondering whether there is a speedup in python that could be done with the @ operations.

X : ndarray, shape (n, p)
        Design matrix for regression, with n number of
        observations and p number of model parameters.
y : ndarray, shape (n, b)
        Data, with n number of observations and b number of targets.

in some cases we have more model parameters than observations (e.g. when using betas to predict some variables)

(this insight came from reading this):
https://www.benjaminjohnston.com.au/matmul

in these instances, given that scipy.linalg.blas.sgemm is faster with 'f' than 'c'
perhaps we would perform much faster if the "large" dimension was the first one

scikit-learn version

Hi :)
I was just wondering, if there is specific reason why the scikit-learn version is fixed on 0.23 and not the current 1.0?

*** ValueError: operands could not be broadcast together with shapes (2888,) (216,)

f = fracridge(np.concatenate(desw),  np.concatenate(data2), opt['frac'])
np.concatenate(desw).shape
(648, 216)
np.concatenate(data2).shape
(648, 2888)

The error perhaps happens in here?

fracridge/fracridge/fracridge.py

Line 51 in 46c3090

if X.shape[0] > X.shape[1]:

The next three lines result in ynew being of shape (2888,)

uu, ss, v_t = svd(X.T @ X)
selt = np.sqrt(ss)
ynew = np.diag(1./selt) @ v_t @ (X.T @ y)

this then breaks the line:

ols_coef = (ynew.T / selt).T

optimization in python

A recent change to the Python removed an optimization. I think it can be implemented as shown in the attached image.

FracRidgeRegressor .fit() fails when data has only one column (ValueError: shape-mismatch for sum)

I'm running into trouble when trying to use the .fit() method of FracRidgeRegressor where x has only one column.

fracs_ = np.arange(0., 1., .05)
fr = FracRidgeRegressor(fracs=fracs_, fit_intercept=True, normalize=True)

# x_train is array of shape (31240, 1)
# y_train is array of shape (31240, 211339)
fr.fit(x_train, y_train)

# *** ValueError: shape-mismatch for sum

I've tried removing redundant dimension with x_train.squeeze(), which only produced

*** ValueError: Expected 2D array, got 1D array instead:
array=[ 0.          0.          0.         ... -0.17669481 -0.05394767
 -0.01017657].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Interestingly, I run into the same problem when y has only one target. Using fracridge.fracridge instead of fracridge.FracRidgeRegressor seems to work though. Likewise, when x or y has more than one column, the .fit method works fine as well. Might this be an issue with the implementation, or am I misunderstanding something?

I'm currently trying this version pulled from github

import fracridge
fracridge.__version__
'1.3.2.dev364623090'

Cheers!

better documentation

Can we improve fracridge's documentation, to make it more user friendly ?

In particular, perhaps we need an example that shows how multipel targets can be handled? in particular, plot_diabetes.html shows an example of just one target? maybe we need an example of multiple targets and how the user can do n-fold cross-validation or something??

For that matter, the MATLAB side of things probably needs more examples too... hmm...

We currently have this: https://nrdg.github.io/fracridge/
But shouldn't we have something that looks like: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge ?

The docstrings seem more informative than what is available through the web, so maybe we should propagate ?

np.squeeze of the regression coefficients breaks glmsingle

We encountered an error in the Python version of GLMsingle when analyzing a particular dataset. It turns out that the error was caused by the output of fracridge.py. The following line squeezes the resulting coefficient array:

fracridge/fracridge/fracridge.py

Line 210 in 1ac49ca

return coef.squeeze(), alphas

There was an edge condition for our fMRI data where the following line
https://github.com/cvnlab/GLMsingle/blob/c9326eefb57a74d49050b2330fd7267b979bc06b/glmsingle/ols/glm_estimatemodel.py#L561
called fracridge with a scalar fracs (a single ridge penalty) and a data2 array with the shape (1,629) (a single voxel x 629 timepoints). The np.squeeze call reduced the coefficient array from a 1 x 629 matrix to a 629-long vector, which is not what glmsingle was expecting, hence throwing an error later.

The following line in glmsingle was assuming that the betas array is either 2D or 3D, where in fact it was 1D:
https://github.com/cvnlab/GLMsingle/blob/d88744be194fdae701764494c663c26e13b5a5ec/glmsingle/ols/glm_estimatemodel.py#L728

In general, I think that a squeeze operation is unsafe in this context - the user might run a script in which pp, ff or bb are sometimes 1. The dimensions of the coefficient matrix shouldn't change as a function of this condition.

The only case in which an np.squeeze call might be safe is when fracs is scalar (not a single element vector). Then the second dimension can be specifically squeezed by coef.squeeze(1).

EDIT: in the pull request below, I introduced a conditional squeeze of both the target and fracs dimensions that does not include the problematic singleton fracs or targets cases.

python 3.9 incompatibility

It appears that on Python 3.9, fracridge indicates a requirement of scikitlearn 0.23.2... But when trying to install scikit learn 0.23.2, I get error messages (crazy c-code compilation errors). (Things seem to work for Python 3.8.)

Perhaps fracridge can be updated to be compatible with the latest scikit learn (0.24.2?)