nrdg / fracridge Goto Github PK
View Code? Open in Web Editor NEWFractional ridge regression
Home Page: https://nrdg.github.io/fracridge
License: BSD 2-Clause "Simplified" License
Fractional ridge regression
Home Page: https://nrdg.github.io/fracridge
License: BSD 2-Clause "Simplified" License
f = fracridge(np.concatenate(desw), np.concatenate(data2), opt['frac'])
np.concatenate(desw).shape
(648, 216)
np.concatenate(data2).shape
(648, 2888)
The error perhaps happens in here?
fracridge/fracridge/fracridge.py
Line 51 in 46c3090
The next three lines result in ynew being of shape (2888,)
uu, ss, v_t = svd(X.T @ X)
selt = np.sqrt(ss)
ynew = np.diag(1./selt) @ v_t @ (X.T @ y)
this then breaks the line:
ols_coef = (ynew.T / selt).T
Hello everyone, thank you very much for your work. I am a newcomer who has just started exploring ridge regression. I would like to ask if it's possible to reverse the predicted values obtained from FRR back to their original scale? In my research, the range of the dependent variable is quite large (e.g., from 5 to 50 points). Ultimately, I want to create a scatter plot of predicted scores against actual scores, and I hope that the scale of the predicted values is consistent with the actual values. I have successfully executed the following code using Matlab; however, I have observed that the scale of the prediction scores (i.e., "pred") differs from that of the input-dependent variable "y" (i.e., "y_train"). Do you have any assistance or advice to offer regarding this issue? Thank you so much in advance.
The matlab code:
beta_tmp= fracridge(feat_train,lambda,y_train); %lambda is a set of optimal lambda values corresponding to each y variable; beta_tmp is a #feature x #lambda x #behaviors
for b = 1:size(beta_tmp,3)
beta(:,b) = beta_tmp(:,b,b);
end
%% Predicting
pred = feat_test * beta; %pred: #sub x #behaviors
i'm wondering whether there is a speedup in python that could be done with the @ operations.
X : ndarray, shape (n, p)
Design matrix for regression, with n number of
observations and p number of model parameters.
y : ndarray, shape (n, b)
Data, with n number of observations and b number of targets.
in some cases we have more model parameters than observations (e.g. when using betas to predict some variables)
(this insight came from reading this):
https://www.benjaminjohnston.com.au/matmul
in these instances, given that scipy.linalg.blas.sgemm is faster with 'f' than 'c'
perhaps we would perform much faster if the "large" dimension was the first one
Is there an easy way of combing fractional RR with efficient leave-one-out cross-validation?
SKLearn has a documented implementation of LOOCV for standard RR:
https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09bcc2eaeba98f7e737aac2ac782f0e5f1/sklearn/linear_model/_ridge.py#L1432
Edit: the linked code should be reviewed in light of this issue:
scikit-learn/scikit-learn#18079
(TL;DR: although the scikit-learn code mentions GCV, an algebraic form of LOOCV is implemented).
Hi :)
I was just wondering, if there is specific reason why the scikit-learn version is fixed on 0.23 and not the current 1.0?
I'm running into trouble when trying to use the .fit()
method of FracRidgeRegressor where x has only one column.
fracs_ = np.arange(0., 1., .05)
fr = FracRidgeRegressor(fracs=fracs_, fit_intercept=True, normalize=True)
# x_train is array of shape (31240, 1)
# y_train is array of shape (31240, 211339)
fr.fit(x_train, y_train)
# *** ValueError: shape-mismatch for sum
I've tried removing redundant dimension with x_train.squeeze()
, which only produced
*** ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0. 0. ... -0.17669481 -0.05394767
-0.01017657].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Interestingly, I run into the same problem when y has only one target. Using fracridge.fracridge
instead of fracridge.FracRidgeRegressor
seems to work though. Likewise, when x or y has more than one column, the .fit
method works fine as well. Might this be an issue with the implementation, or am I misunderstanding something?
I'm currently trying this version pulled from github
import fracridge
fracridge.__version__
'1.3.2.dev364623090'
Cheers!
Thanks for making this implementation publicly available, It looks like a very promising tool!
I wanted to give it a go on an fmri dataset which has a large amount of targets (i.e. voxels).
I'm getting an error message when trying to fit to more than ~280 targets. Here's the error message I get when trying 1000 targets:
train.shape, stimdesign_conv.shape
>> ((284, 1000), (284, 50))
n_alphas = 5
fracs = np.linspace(0,1, n_alphas)
fr = FracRidgeRegressor(
fracs,
fit_intercept=False,
normalize=False,
)
fr = fr.fit(stimdesign_conv, train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-56-8a37e1c17735> in <module>
6 normalize=False,
7 )
----> 8 fr = fr.fit(stimdesign_conv, train)
~/.conda/envs/thingsmri_env/lib/python3.8/site-packages/fracridge/fracridge.py in fit(self, X, y, sample_weight)
309 X, y, X_offset, y_offset, X_scale = self._validate_input(
310 X, y, sample_weight=sample_weight)
--> 311 coef, alpha = fracridge(X, y, fracs=self.fracs, tol=self.tol,
312 jit=self.jit)
313 self.alpha_ = alpha
~/.conda/envs/thingsmri_env/lib/python3.8/site-packages/fracridge/fracridge.py in fracridge(X, y, fracs, tol, jit)
149
150 # Calculate the rotation of the data
--> 151 selt, v_t, ols_coef = _do_svd(X, y, jit=jit)
152
153 # Set solutions for small eigenvalues to 0 for all targets:
~/.conda/envs/thingsmri_env/lib/python3.8/site-packages/fracridge/fracridge.py in _do_svd(X, y, jit)
62 ynew = uu.T @ y
63
---> 64 ols_coef = (ynew.T / selt).T
65
66 return selt, v_t, ols_coef
ValueError: operands could not be broadcast together with shapes (1000,) (50,)
I'm guessing it could have something to do with fracridge._do_svd
changing it's behavior depending on the memory demand.
I'm not sure if it's a bug or limiting the number of targets is the only solution.
Thanks in advance!
Can we improve fracridge's documentation, to make it more user friendly ?
In particular, perhaps we need an example that shows how multipel targets can be handled? in particular, plot_diabetes.html shows an example of just one target? maybe we need an example of multiple targets and how the user can do n-fold cross-validation or something??
For that matter, the MATLAB side of things probably needs more examples too... hmm...
We currently have this: https://nrdg.github.io/fracridge/
But shouldn't we have something that looks like: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge ?
The docstrings seem more informative than what is available through the web, so maybe we should propagate ?
It appears that on Python 3.9, fracridge indicates a requirement of scikitlearn 0.23.2... But when trying to install scikit learn 0.23.2, I get error messages (crazy c-code compilation errors). (Things seem to work for Python 3.8.)
Perhaps fracridge can be updated to be compatible with the latest scikit learn (0.24.2?)
docs v1.3 swapped the "d" and "i", so the package wont install if you copy paste.
pip install fracrdige
should be
pip install fracridge
We encountered an error in the Python version of GLMsingle when analyzing a particular dataset. It turns out that the error was caused by the output of fracridge.py
. The following line squeezes the resulting coefficient array:
fracridge/fracridge/fracridge.py
Line 210 in 1ac49ca
There was an edge condition for our fMRI data where the following line
https://github.com/cvnlab/GLMsingle/blob/c9326eefb57a74d49050b2330fd7267b979bc06b/glmsingle/ols/glm_estimatemodel.py#L561
called fracridge
with a scalar fracs
(a single ridge penalty) and a data2
array with the shape (1,629) (a single voxel x 629 timepoints). The np.squeeze
call reduced the coefficient array from a 1 x 629 matrix to a 629-long vector, which is not what glmsingle was expecting, hence throwing an error later.
The following line in glmsingle was assuming that the betas array is either 2D or 3D, where in fact it was 1D:
https://github.com/cvnlab/GLMsingle/blob/d88744be194fdae701764494c663c26e13b5a5ec/glmsingle/ols/glm_estimatemodel.py#L728
In general, I think that a squeeze operation is unsafe in this context - the user might run a script in which pp
, ff
or bb
are sometimes 1. The dimensions of the coefficient matrix shouldn't change as a function of this condition.
The only case in which an np.squeeze call might be safe is when fracs
is scalar (not a single element vector). Then the second dimension can be specifically squeezed by coef.squeeze(1)
.
EDIT: in the pull request below, I introduced a conditional squeeze of both the target and fracs dimensions that does not include the problematic singleton fracs or targets cases.
It looks like numba sometimes has trouble crunching through very large data. Might be worth making the jit-compilation optional.
Something like:
def fracridge(X, y, fracs=None, tol=1e-6, jit=True):
Hi :)
I experienced two small errors. The first one is that an import of lapack
is missing (https://github.com/nrdg/fracridge/blob/master/fracridge/_linalg.py#L102) and the second one is that the splitting of the too long line (https://github.com/nrdg/fracridge/blob/master/fracridge/fracridge.py#L39) throws an error.
np.interp
requires non-decreasing inpuits. If the user provides an input such as [0.1, 0.3, 0.2]
, it might provide garbage as output. A solution could be to check, sort and warn the user if they provide input that doesn't make sense.
Hi Ariel - Nice idea.
The Readme page link to the Matlab implementation is a 404.
Onwards,
Brian
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.