radlfabs / flexcv Goto Github PK

Python package customizing nested cross validation for tabular data.

Home Page: https://radlfabs.github.io/flexcv/

License: MIT License

Python 100.00%

cross-validation grouped-datasets machine-learning mixed-effects mixed-effects-models neptune-ai nested-cross-validation regression regression- regression-models

flexcv's People

Contributors

Stargazers

Watchers

flexcv's Issues

PyPI Release not triggered

Currently, the PyPI workflow is not triggered properly by tags in commits.
I.e. 4e13ecb was tagged but in https://github.com/radlfabs/flexcv/actions/runs/7628324895 the PYPI release was skipped.

Add user-guide & API reference

Our docs neeed a detailled user guide and reference of the objects.

yaml NameError

from flexcv import CrossValidation
Traceback (most recent call last):
File "", line 1, in
File "D:\GitHub\flexcv-test\dev\Lib\site-packages\flexcv_init_.py", line 2, in
from .interface import CrossValidation
File "D:\GitHub\flexcv-test\dev\Lib\site-packages\flexcv\interface.py", line 28, in
from .yaml_parser import read_mapping_from_yaml_file, read_mapping_from_yaml_string
File "D:\GitHub\flexcv-test\dev\Lib\site-packages\flexcv\yaml_parser.py", line 11, in
loader: yaml.SafeLoader, node: yaml.nodes.ScalarNode
^^^^
NameError: name 'yaml' is not defined

Update Getting Started

As soon as the interface/API is fixed, we need a great Getting Started in both docs and readme.

Rename KFold Classes

Rename the custom KFold classes meaningfully.
It should be clear what they do and what pro/cons are
Update the doc strings where needed.

Update Docs

Installation guide in the docs must adress pip proceude when package is on pypi.

Bug: pformat_dict not imported from utilities

from flexcv import CrossValidation
from flexcv.synthesizer import generate_regression
X, y, groups, _ = generate_regression(12, 300, 10)
from flexcv.models import LinearModel

cv = CrossValidation()

performed = (
    cv
    .set_data(X, y, groups)
    .set_splits()
    .add_model(LinearModel)
    .perform()
    .get_results()
)

raises "NameError: name 'pformat_dict' is not defined".

Problem is here:

flexcv/flexcv/results_handling.py

Line 79 in e260729

return f"CrossValidationResults {pformat_dict(self)}"

We need to import pformat_dict from flexcv.utilities.

Python 3.12

Add compatibility for Python 3.12.
Currently awaiting neptune support (issue)
which is relying on Numba (issue).

yaml parser missing in Refs

The yaml is not present in the reference/docs.

BUG: ImportError in neptune-sklearn's dep

Importing the sklearn integration for neptune fails at the moment due to reiinakano/scikit-plot#119.
Reported also on the neptune repo.

Build Package

The python package has to be build when everything is ready.

Generate binaries/archive with build
Test if the archive has all files

Reduce model-loop: minimize LMER/MERF dependencies in cross_validate

Currently the cross validation loop iterates over base estimators..
With a flag we enter a second code block inside the loop where the mixed effects modeling takes place.
Therefore, LMER and MERF need to be treated differently as another model type as any other base estimator.
This is also due to the differences in API.
It would be benefitial to reduce this second part of the loop. MERF will always be in a different hierarchy because it is optimizing on top of base estimators. Unfortunately also LMER can not be base model inside MERF because MERF only supports fixed effects fitting internally for the base model.
A possible solution would be, to add a flag "fit_merf" to the ModelConfigDict, which triggers the part of the loop after fitting the base estimator to also enter into MERF fitting and evaluation.

Fit_Kwargs to allow custom callbacks

We need to allow custom callbacks to allow people to use e.g. the neptune-xgboost integration.

For example, the XGBRegressor with the sklearn interface has the following keyword argument in .fit():

callbacks: Optional[Sequence[TrainingCallback]] = None,

Which is defined as

callbacks (Optional[List[TrainingCallback]]) –

List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using Callback API.

A callback where a run object is passed (e.g. in the neptune-xgboost integration) would only work without dummy instantiation.
Therefore, we could create create a dict in the mappings in our interface as follows

model_callback = CustomCallback(run)
CrossValidation.config["mapping"]["model_name"]["fit_kwargs"] = {"callbacks" = [model_callback]} if model_callback is not None else {}

Then in core.cross_validate we could instantiate the fit_kwargs dict by indexing the model mapping

fit_kwargs = mapping[model_name]["fit_kwargs"]

Which then will be unpacked to the fit call.

Make test cases

to check if the interface assignment works
a mock callback was called.

Complete Testsuite

unit test core and utility modules
rather complex pytest runs and combinations of models
comparison tests to sklearn wherever possible

Update Tests for cross validate and its classes

After restructuring the cross validation loop we need some more tests for the

SingleModelFoldResult class and its make_results method
ProsProcessing Classes
for some of the interface functionality
for the new defaults in model mapping
for the model mapping -> interface interaction
for the "add_merf" keyword
for the kwarg handling in the interface

Testing Suite takes too long

Due to the iterative nature of processes and evaluations in flexcv the testing suite just takes too long, e.g. > 4 hrs.
This is to reduce testing times as much as possible.

Create DOI

A citable DOI has to be generated.

Make a GitHub release
go to linked Zenodo Account and create DOI
add badge link to the readme and update

YAML Parser for Model Mapping

For reuse of model configurations it would be nice to work with yaml files.
Therefore, a parser is needed. Main features:

translate yaml ModelMappingDicts
create inner dicts as ModelConfigDicts
instantiate the parameter distributions for the models
make imports of model classes and post_processor classes

This will make it easier to reuse larger configurations.

Add make_model_mapping function

When first starting with the CrossValidation object the user needs to instantiate a ModelMappingDict which uses ModelConfigDict.
This might seem unintuitive for users coming from sklearn. Also users have to know, they would need to import or use this.

Possible solution:

implement a function that helps making this mapping with an easy name
check if this maker function could also be used in method chaining on CrossValidation
CrossValidation could also implement a check if the dict is already an instance of ModelMappingDict, so the user does not have to call it before.

Allow passing KFold Iterator directly in the CrossValidation class

At the moment the CrossValidation interface takes either strings or proprietary enums that are parsed as split methods in the process.
It would be benefitial to allow passing of cross validators directly, so users could easily write and integrate their own class or simply pass an sklearn iterator directly.
Currently the parsing is done by flexcv.split.make_cross_val_split. This could check for a Callable Cross Val Iterator and return its split method if True.

README code fails

The README has some bugs in the example code:

A non closing paranthesis here
And a missing import statement of the module
Alternatively,

flexcv/README.md

Line 85 in 004d0b1

method_outer_split=flexcv.CrossValMethod.GROUP

could could be changed to method_outer_split="GroupKFold", in order to avoid the module level import.

Bug: Inproper display of type annotations in reference.md

In Reference.md some type annotations are irregular parsed. This seems to be module related.
For some modules the types are not recognized at all and are parsed as part of the descriptions, i.e. displayed in the wrong column.
Working in interface.py, not working e.g. in results_handling.py.
Correct:

Incorrect:

Move Earth Wrapper Class & rpy2 dependency

Rpy2 is raising errors from CFFI. This does not harm the computational side at the moment but results in annoying pop ups.

Related issues in rpy2: rpy2/rpy2#1063
This may be fixed in rpy2/rpy2#1020 but remains unsolved to day.

Therefore I will move my rpy2 part regarding the earth regressor to flexcv-earth.

Upload package

Upload to Test PyPI
Test the package
Upload to PYPI
Release to Github

Bug: RF is non deterministic

Random Forest seems to be not deterministic in my test case with a simple kfold/kfold nested cross validation and only tuning max_depth in optua.

Allow Classification

Currently flexcv is based around regression.
Let's check what has to be done in order to allow classifiers as well.

Metrics: test with the battery of Acc/Recall/F1 etc.
Any new args/kwargs for the models?
Test with other split methods for classes
allow categorical transformers

Tests for our Docs

Tests for the docs are missing!

Would be nice to check code blocks.
Check markdown syntax
Does it build correctly?

Update docs after changes to interface

In #13 we make some decisive steps by generalizing the interface and reducing model dependencies inside core functions. The docs need to be updated regarding

CrossValidation.set_merf
CrossValidation.set_lmer
ModelConfigDict -> whats necessary to set here? what individual value will override global values?
Update reference for PostProcessors.

Interface: Better Return Type Hints

Add better return type hints to the interface methods in order to improve the IDE hints available when coding.

radlfabs / flexcv Goto Github PK

flexcv's People

Contributors

Stargazers

Watchers

flexcv's Issues

Recommend Projects

Recommend Topics

Recommend Org