sm00thix / ikpls Goto Github PK

Fast CPU and GPU Python implementations of Improved Kernel PLS by Dayal and MacGregor (1997) and Shortcutting Cross-Validation by Engstrøm (2024).

Home Page: https://ikpls.readthedocs.io/en/latest/

License: Apache License 2.0

Python 97.53% TeX 2.47%

data-science gpu-support linear-regression partial-least-squares partial-least-squares-regression pls plsda plsr algorithm tpu-acceleration

ikpls's People

Contributors

Stargazers

Watchers

Forkers

yuanaizen parmentelat boisgera

ikpls's Issues

License appendix

Hi @parmentelat ,

You previously helped me with the LICENSE in relation to your review over at JOSS. I'm wondering if the following part is meant to be deleted given that I have already followed the instructions in that part. Will you clarify whether or not I should delete this part?

Best,
Ole

APPENDIX: How to apply the Apache License to your work.

  To apply the Apache License to your work, attach the following
  boilerplate notice, with the fields enclosed by brackets "[]"
  replaced with your own identifying information. (Don't include
  the brackets!)  The text should be enclosed in the appropriate
  comment syntax for the file format. We also recommend that a
  file or class name and description of purpose be included on the
  same "printed page" as the copyright notice for easier
  identification within third-party archives.

Issues with GitHub Actions macos-latest (macos-14) runner

Hi @parmentelat,

I noticed some strange issue and I would like to get your input as to how to address it.

In relation to my #24 (comment) I decided to also upgrade the macos runners from macos-12 to macos-latest for testing GitHub actions.

Annoyingly, testing with macos-latest (macos-14) using GitHub actions causes the test test_sanity_check_pls_regression_constant_column_Y to fail as evident by this run. The errors are not present on the previous version of the macos runner - i.e., macos-12 (macos-13 runners do not exist unless you are enterprise) as evident by this run. The errors are also not present on any of ubuntu-latest and windows-latest. I have a Mac of my own, running macos 14, where I am unable to reproduce the error, and the test passes as it should.

I only have two plausible explanations for this and I am unable to verify either:

The macos-14 runners are running on arm64 processors whereas the macos-12 runners are running on x64 processors. My own Mac is also running on an x64 processor. Is this some freaky edge case where the different processors cause different behavior?
The macos-14 runners may simply be faulty. I have had issues with them previously when they were newer (they are still only a few months old).

Should I write something about this somewhere or just assume that it's an error with the runner? What is best practice here?

tests are failing ?

I have tried to run the provided tests on my macbook pro

as touched on in #10 I had to guess my way a little, and here's what I did

pip install -e .
pip install pandas pytest
pytest

the outcome was

=================== 14 failed, 12 passed, 2010 warnings in 1196.90s (0:19:56) ====================

so I reckon, there has to be something wrong here

can you please shed light on:

quite generally, is the test suite expected to pass ? on any hardware ?
and more specifically, did I prepare things the right way or not ?
is this result intended because of my hardware ?

thanks in advance

Selection of validation indices in fast cross-validation algorithm

Hi @parmentelat,

I have realized that there is a subtle error in my implementation of the fast cross-validation algorithm.
In practice, the error has close to no effect but it does change the theoretical runtime of the algorithms explained in https://arxiv.org/abs/2401.13185 from Theta(NK(K+M)) to Theta(NK(K+M) + NP) where P is the number of cross-validation splits. The algorithms affected are the ones that compute training-set-wise X^T * X (algorithm 2 only) and X^T * Y.

The error arises due to the selection of validation-indices which, for each cross-validation split, I construct a boolean mask of length N specifying the validation set. This takes NP time.
I have implemented a solution that does not construct a boolean mask of length N for each split. Instead, the new solution constructs for each validation split a list of length N_val, specifying indices into X and Y corresponding to the validation set for the current split. This restores the asymptotic runtime to Theta(NK(K+M)).

In practice this change does not really have an impact on the running due to the practical constants hidden by the Theta-notation. As a practical test, I re-ran the most extreme example in the benchmarks, that is
time_pls.py -model fastnp2 -n 1000000 -k 500 -m 10 -n_components 30 -n_splits 1000000 -n_jobs -1
and found that the change in runtime is so small that it is not even visible on the graph.

Additionally, a side-effect of my proposed solution is that the argument cv_splits to ikpls.fast_cross_validation.numpy_ikpls.PLS.cross_validate can now be any iterable of hashable (due to the internal usage of dictionaries) instead of just integer arrays. This allows a user to define their cross-validation splits using, e.g., strings if they like. However, it no longer sorts the validation splits. Instead, the resulting metrics will be sorted according to the first occurence of each unique validation split in cv_splits.

I have updated the tests and doc strings to reflect these changes.

I do apologize for any disruption of your work in relation to the review at openjournals/joss-reviews#6533

My questions are:

Will you accept these changes to the code?
Will you accept the timings.png as is or do you require that I rerun the benchmarks related to the fast cross-validation algorithms?

The changes are reflected in 9efdc1a
with a small update in d4b16a6
and 697cc48

I also wonder if adding the option to return the order, in which cross-validation was performed, would improve usability. Do you have any comment to this?

tests failing

with my repo on 4762f65 I am experiencing one test failure; open to see details

pytest
=========================================================================== test session starts ============================================================================
platform darwin -- Python 3.12.3, pytest-8.1.1, pluggy-1.4.0
rootdir: /private/tmp/IKPLS
configfile: pyproject.toml
collected 26 items

tests/test_ikpls.py ..........F............... [100%]

===================================== FAILURES ======================================
___________________________ TestClass.test_gradient_pls_1 ___________________________

self = <tests.test_ikpls.TestClass object at 0x156a850d0>

def test_gradient_pls_1(self):
    """
    Description
    -----------
    This test loads input predictor variables and a target variable with a single
    column and calls the 'check_gradient_pls' method to validate the gradient
    propagation for reverse-mode differentiable JAX PLS.

    Returns:
    None
    """
    X = self.load_X()
    Y = self.load_Y(["Protein"])
    num_components = 25
    filter_size = 7
    assert Y.shape[1] == 1

  self.check_gradient_pls(

        X=X,
        Y=Y,
        num_components=num_components,
        filter_size=filter_size,
        val_atol=0,
        val_rtol=1e-5,
        grad_atol=0,
        grad_rtol=1e-5,
    )

tests/test_ikpls.py:2191:

tests/test_ikpls.py:2173: in check_gradient_pls
assert_allclose(output_val_alg_2, output_val_diff_alg_2, atol=0, rtol=2e-11)

args = (<function assert_allclose..compare at 0x15a5949a0>, array(12.5455451), array(12.5455451))
kwds = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=2e-11, atol=0', 'verbose': True}

@wraps(func)
def inner(*args, **kwds):
    with self._recreate_cm():

      return func(*args, **kwds)

E AssertionError:
E Not equal to tolerance rtol=2e-11, atol=0
E
E Mismatched elements: 1 / 1 (100%)
E Max absolute difference: 3.54010155e-10
E Max relative difference: 2.8217997e-11
E x: array(12.545545)
E y: array(12.545545)

/Users/tparment/miniconda3/envs/my-ikpls/lib/python3.12/contextlib.py:81: AssertionError
------------------------------- Captured stdout call --------------------------------
stateless_fit for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_preprocess_input_matrices for Improved Kernel PLS Algorithm #1 will be JIT compiled...
get_means for Improved Kernel PLS Algorithm #1 will be JIT compiled...
get_stds for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_get_initial_matrices for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_1 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_main_loop_body for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_2 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_3 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_3_body for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_4 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_5 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
stateless_predict for Improved Kernel PLS Algorithm #1 will be JIT compiled...
stateless_fit for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_preprocess_input_matrices for Improved Kernel PLS Algorithm #2 will be JIT compiled...
get_means for Improved Kernel PLS Algorithm #2 will be JIT compiled...
get_stds for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_get_initial_matrices for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_1 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_main_loop_body for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_2 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_3 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_3_body for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_4 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_5 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
stateless_predict for Improved Kernel PLS Algorithm #2 will be JIT compiled...
stateless_fit for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_preprocess_input_matrices for Improved Kernel PLS Algorithm #1 will be JIT compiled...
get_means for Improved Kernel PLS Algorithm #1 will be JIT compiled...
get_stds for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_get_initial_matrices for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_1 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_main_loop_body for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_2 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_3 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_3_body for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_4 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
_step_5 for Improved Kernel PLS Algorithm #1 will be JIT compiled...
stateless_predict for Improved Kernel PLS Algorithm #1 will be JIT compiled...
stateless_fit for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_preprocess_input_matrices for Improved Kernel PLS Algorithm #2 will be JIT compiled...
get_means for Improved Kernel PLS Algorithm #2 will be JIT compiled...
get_stds for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_get_initial_matrices for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_1 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_main_loop_body for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_2 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_3 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_3_body for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_4 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
_step_5 for Improved Kernel PLS Algorithm #2 will be JIT compiled...
stateless_predict for Improved Kernel PLS Algorithm #2 will be JIT compiled...
============================== short test summary info ==============================
FAILED tests/test_ikpls.py::TestClass::test_gradient_pls_1 - AssertionError:
===================== 1 failed, 25 passed in 1403.61s (0:23:23) =====================

Changes to paper.md introduction and first section in README

Quoting @basileMarchand from the review at JOSS:

In the JOSS paper, it would be good in my opinion to have a more "simplified" introduction in the summary so that a less experienced reader can easily understand what need the package is addressing. Also, the first paragraph of the README (which is also the first paragraph of the documentation) should be made more engaging to catch the attention of readers who might be searching without really knowing what they need. This would help attract more users and build a more diverse community.

Support `numpy=2.0.1`?

Hello, thank you for maintaining this useful project! Are there any plans to support numpy=2.0.1 in the near future?

tests are broken as of 816ff7b

it seems that commit 816ff7b has - badly enough - broken the tests

benchmarking and reproducibility

the paper comes with a very nice figure comparing the performance of various setups and algorithms
also the sources come with several files whose name suggest they are useful to reproduce these results

./time_pls.py
./plot_timings.py
as well as the whole ./timings folder

my question is, are there any plans to document how to use this material if one wanted to reproduce the results in the paper ?
(or is that documented already and I missed it ?)

Clarify installation instructions for GPU usage

Hello again !

In the installation section, it is mentioned that running pip install ikpls is sufficient, which is true for CPU usage.
However, for GPU usage, there is additional work required, particularly for the installation of Jax. A user might mistakenly believe that running pip install ikpls on a computer with a GPU would automatically configure everything for GPU usage, which is not the case.

Could you please add a warning in the documentation to inform users that if they want to use the GPU capabilities, they need to install Jax separately and give ? Ideally, provide some guidance on how to perform this installation.

Thank you !

For review openjournals/joss-reviews#6533

reproducibility and estimated measurements

a final word about the paper's reproducibility

some values in the figure appear with a circle; about that the legend states that:

A square means that the experiment was run until the time per iteration had stabilized and used to forecast the time usage if the experiment was run to completion

can you please comment further about the practical means, if any, to achieve that, and about possible means to automate that process nevertheless ?

contributing doc - redeux & how to rebuild the doc

I have had a second go at the contributing documentation, now that it is more accessible thanks to #10

turns out some of my initial concerns are still to be addressed

so I have restarted from scratch and put myself in a new potential contributor's shoes
and came across a few further glitches

the doc itself

the doc still does not address running the tests, which is one of the first things one would want to do
it does not either address rebuilding the doc itself
it could profitably mention the creation of a virtual env

more importantly, I could not find a way to rebuild the doc locally (more on this below)

about 1. : to illustrate all this you can take a look at this page here:
https://github.com/parmentelat/IKPLS/blob/contrib-doc/CONTRIBUTING.md

note: I was about to create a PR from that material, but refrained from doing so because I can neither read nor write nor preview reStructuredText, so I've had to move to markdown, which is maybe not what you'll want to do... regardless, the contents is what matters here anyways

about 2. : for rebuilding the doc, I had to mess with docs/Makefile like here
parmentelat@3f9a832
as I can't access the builds page on readthedocs, I can't say whether the builds pass over there but it feels unlikely ?

pylint-friendliness & very long lines

Hey there

as part of the review initiated in openjournals/joss-reviews#6533, I am having a first stab at this repo :)

a very first and general comment that I have about the Python code relates to pylint assessment:

particularly the "line-too-long" messages that hinder smooth reading (I take it your editor wraps long lines ? I have seen some of them reaching 350+ characters, this is not a good practice !)
and more generally a rather poor score out of pylint, which gives me
- a score of 0/10 on the ikps folder
- a score of 5.14/10 on the examples folder

given the heavy use of 'math-like' variables, I guess it is acceptable to allow for the invalid-name issues
(for example if I do allow that on one source file (ikpls/jax_ikpls_alg_2.py) its score rises from 0/10 to 4.15/10)

so i guess my proposal would be

to achieve a better score by pylint in a best-effort way
by at least sticking to a strict 88-columns max line width
and solving - or at least starting to solve - the other issues, apart from the "invalid-name" ones that are endemic but acceptable due to the variable naming scheme

LICENSE needs to be filled

this line here

   Copyright [yyyy] [name of copyright owner]

is supposed to be filled

Distinguish between Test and Packaging in GitHub Actions and add separate badges

Hello

For automated tests, IKPLS uses GitHub Actions to automate the testing, packaging, and deployment processes.
However, in the GitHub Actions interface, it is difficult to access on the test results because it is not clearly distinguished from the packaging part and in addition the two workflows are named "Python Package". Additionally, the project's README has a packaging badge but no Test badge.

Could you please:

Clearly distinguish between test and packaging steps in the GitHub Actions workflow.
Add separate badges for Test and Packaging in the README.

Thank you!

For review openjournals/joss-reviews#6533

contributing documentation improvements

here are a few ideas for possible improvements on the contributing documentation

first off, unless I missed it, this is not part of the published documentation on readthedocs, was that intentional ?
second, it would help to give a copy-able fragment of the useful commands when setting up a devel environment; given that you mention poetry in pyproject.toml it feels like this doc could fruitfully give more insights about how to use poetry in this context (indeed poetry is less well-known than git and github, so if you give detailed instructions with the former then it is all the more useful with the latter)