wenjiez / tscv Goto Github PK

Time Series Cross-Validation -- an extension for scikit-learn

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

backtesting cross-validation data-science hyperparameter-optimization machine-learning model-selection time-series tuning-parameters

tscv's People

Contributors

Stargazers

Watchers

tscv's Issues

Does this work with sklearn 1.2?

There seems to be some changes with the sklearn library that's causing some compatibility issues with tscv. Just wondering if im doing something wrong or tscv currently doesn't support sklearn 1.2

GapKFold CV not working with sklearn cross_val_predict

Versions:

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.24.0

tscv                      0.1.3              pyhd8ed1ab_0    conda-forge
scikit-learn              1.5.0           py311hdcb8d17_1    conda-forge
pandas                    2.2.2           py311hcf9f919_1    conda-forge
numpy     : 1.26.4

Minimum reproducible code:

from tscv import GapKFold, GapLeavePOut, GapRollForward
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# generate data
rng = np.random.default_rng(0)
size = 10
X = pd.DataFrame({"f1": rng.standard_normal(size), "f2": rng.standard_normal(size)})
y = pd.DataFrame(rng.integers(low=0, high=2, size=size), columns=["target"])
X, y

# setup cv
cv = GapKFold(n_splits=3, gap_before=1, gap_after=1)
for train, test in cv.split(X):
    print("train:", train, "test:", test)
    print("train:", len(train), "test:", len(test))

# setup pipeline and predict
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=10))
preds = cross_val_predict(pipe, X, y, cv=cv, n_jobs=-1, method='predict_proba')

# error 
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\externals\loky\process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\externals\loky\process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\utils\parallel.py", line 129, in __call__
    return self.function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\model_selection\_validation.py", line 1390, in _fit_and_predict
    predictions = _enforce_prediction_order(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kngka\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\model_selection\_validation.py", line 1457, in _enforce_prediction_order
    predictions_for_all_classes[:, classes] = predictions
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
IndexError: index 1 is out of bounds for axis 1 with size 1
"""

The above exception was the direct cause of the following exception:

IndexError                                Traceback (most recent call last)
Cell In[18], line 2
      1 pipe = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=10))
----> 2 preds = cross_val_predict(pipe, X, y, cv=cv, n_jobs=-1, method='predict_proba')

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\utils\_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
    222         str(e),
    223     )

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\model_selection\_validation.py:1282, in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, params, pre_dispatch, method)
   1279 # We clone the estimator to make sure that all the folds are
   1280 # independent, and that it is pickle-able.
   1281 parallel = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch)
-> 1282 predictions = parallel(
   1283     delayed(_fit_and_predict)(
   1284         clone(estimator),
   1285         X,
   1286         y,
   1287         train,
   1288         test,
   1289         routed_params.estimator.fit,
   1290         method,
   1291     )
   1292     for train, test in splits
   1293 )
   1295 inv_test_indices = np.empty(len(test_indices), dtype=int)
   1296 inv_test_indices[test_indices] = np.arange(len(test_indices))

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\utils\parallel.py:67, in Parallel.__call__(self, iterable)
     62 config = get_config()
     63 iterable_with_config = (
     64     (_with_config(delayed_func, config), args, kwargs)
     65     for delayed_func, args, kwargs in iterable
     66 )
---> 67 return super().__call__(iterable_with_config)

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py:2007, in Parallel.__call__(self, iterable)
   2001 # The first item from the output is blank, but it makes the interpreter
   2002 # progress until it enters the Try/Except block of the generator and
   2003 # reaches the first `yield` statement. This starts the asynchronous
   2004 # dispatch of the tasks to the workers.
   2005 next(output)
-> 2007 return output if self.return_generator else list(output)

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py:1650, in Parallel._get_outputs(self, iterator, pre_dispatch)
   1647     yield
   1649     with self._backend.retrieval_context():
-> 1650         yield from self._retrieve()
   1652 except GeneratorExit:
   1653     # The generator has been garbage collected before being fully
   1654     # consumed. This aborts the remaining tasks if possible and warn
   1655     # the user if necessary.
   1656     self._exception = True

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py:1754, in Parallel._retrieve(self)
   1747 while self._wait_retrieval():
   1748 
   1749     # If the callback thread of a worker has signaled that its task
   1750     # triggered an exception, or if the retrieval loop has raised an
   1751     # exception (e.g. `GeneratorExit`), exit the loop and surface the
   1752     # worker traceback.
   1753     if self._aborting:
-> 1754         self._raise_error_fast()
   1755         break
   1757     # If the next job is not ready for retrieval yet, we just wait for
   1758     # async callbacks to progress.

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py:1789, in Parallel._raise_error_fast(self)
   1785 # If this error job exists, immediately raise the error by
   1786 # calling get_result. This job might not exists if abort has been
   1787 # called directly or if the generator is gc'ed.
   1788 if error_job is not None:
-> 1789     error_job.get_result(self.timeout)

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py:745, in BatchCompletionCallBack.get_result(self, timeout)
    739 backend = self.parallel._backend
    741 if backend.supports_retrieve_callback:
    742     # We assume that the result has already been retrieved by the
    743     # callback thread, and is stored internally. It's just waiting to
    744     # be returned.
--> 745     return self._return_or_raise()
    747 # For other backends, the main thread needs to run the retrieval step.
    748 try:

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\joblib\parallel.py:763, in BatchCompletionCallBack._return_or_raise(self)
    761 try:
    762     if self.status == TASK_ERROR:
--> 763         raise self._result
    764     return self._result
    765 finally:

IndexError: index 1 is out of bounds for axis 1 with size 1

Publish on conda

conda-forge/staged-recipes#14574

Error when Importing TSCV Gapwalkforward

Using TSCV Gapwalkforward successfully with Python 3.7.

Suddenly getting following error:

ImportError Traceback (most recent call last)
in
41 #Modeling
42
---> 43 from tscv import GapWalkForward
44 from sklearn.utils import shuffle
45 from sklearn.model_selection import KFold

~\Anaconda3\envs\py37\lib\site-packages\tscv_init_.py in
----> 1 from .split import GapCrossValidator
2 from .split import GapLeavePOut
3 from .split import GapKFold
4 from .split import GapWalkForward
5 from .split import gap_train_test_split

~\Anaconda3\envs\py37\lib\site-packages\tscv\split.py in
7
8 import numpy as np
----> 9 from sklearn.utils import indexable, safe_indexing
10 from sklearn.utils.validation import _num_samples
11 from sklearn.base import _pprint

ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'

Any insight? I get this when simply importing Gapwalkforward.

Continuous Integration

Import error with latest sklearn version

Hi guys, this issue occured after the upgrade to 1.1.3

ImportError: cannot import name '_pprint' from 'sklearn.base'

/.venv/lib/python3.10/site-packages/tscv/_split.py:19 in      │
│ <module>                                                                                         │
│                                                                                                  │
│    16 import numpy as np                                                                         │
│    17 from sklearn.utils import indexable                                                        │
│    18 from sklearn.utils.validation import _num_samples, check_consistent_length                 │
│ ❱  19 from sklearn.base import _pprint                                                           │
│    20 from sklearn.utils import _safe_indexing                                                   │
│    21                                                                                            │
│    22

Could you please fix it ?

Kind regards,
Jim

GapLeavePOut CV not working with sklearn cross_val_predict

Versions:

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.24.0

tscv                      0.1.3              pyhd8ed1ab_0    conda-forge
scikit-learn              1.5.0           py311hdcb8d17_1    conda-forge
pandas                    2.2.2           py311hcf9f919_1    conda-forge
numpy     : 1.26.4

Minimum reproducible code:

from tscv import GapKFold, GapLeavePOut, GapRollForward
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# generate data
rng = np.random.default_rng(0)
size = 10
X = pd.DataFrame({"f1": rng.standard_normal(size), "f2": rng.standard_normal(size)})
y = pd.DataFrame(rng.integers(low=0, high=2, size=size), columns=["target"])
X, y

# setup cv
cv = GapLeavePOut(p=3, gap_before=1, gap_after=1)
for train, test in cv.split(X):
    print("train:", train, "test:", test)
    print("train:", len(train), "test:", len(test))

# setup pipeline and predict
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=10))
preds = cross_val_predict(pipe, X, y, cv=cv, n_jobs=-1, method='predict_proba')

# error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 2
      1 pipe = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=10))
----> 2 preds = cross_val_predict(pipe, X, y, cv=cv, n_jobs=-1, method='predict_proba')

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\utils\_param_validation.py:213, in validate_params.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    207 try:
    208     with config_context(
    209         skip_parameter_validation=(
    210             prefer_skip_nested_validation or global_skip_validation
    211         )
    212     ):
--> 213         return func(*args, **kwargs)
    214 except InvalidParameterError as e:
    215     # When the function is just a wrapper around an estimator, we allow
    216     # the function to delegate validation to the estimator, but we replace
    217     # the name of the estimator by the name of the function in the error
    218     # message to avoid confusion.
    219     msg = re.sub(
    220         r"parameter of \w+ must be",
    221         f"parameter of {func.__qualname__} must be",
    222         str(e),
    223     )

File ~\miniconda3\envs\quantinsti_mlt04\Lib\site-packages\sklearn\model_selection\_validation.py:1260, in cross_val_predict(estimator, X, y, groups, cv, n_jobs, verbose, fit_params, params, pre_dispatch, method)
   1258 test_indices = np.concatenate([test for _, test in splits])
   1259 if not _check_is_permutation(test_indices, _num_samples(X)):
-> 1260     raise ValueError("cross_val_predict only works for partitions")
   1262 # If classification methods produce multiple columns of output,
   1263 # we need to manually encode classes to ensure consistent column ordering.
   1264 encode = (
   1265     method in ["decision_function", "predict_proba", "predict_log_proba"]
   1266     and y is not None
   1267 )

ValueError: cross_val_predict only works for partitions

Documentation

Documentation and examples do not address the splitting of data set into training and test sets.

If using one of the cross validators, does the data set need to be sorted in time order? Is there way to designate a datetime column so the class understands on what basis to sequentially split data?

Deprecation message for `GapWalkForward`

http://www.zhengwenjie.net/tscvfuture/

Warning once is not enough

TSCV/tscv/split.py

Line 253 in f8b832f

warnings.warn(SINGLETON_WARNING, Warning)

This warning should appear for every occurrence. Use standard output instead.

Intution on setting number of gaps

If for example, I have data without gaps, when and why would I still create a break between my train and validation? I have seen the argument for setting gaps when the period that needs to be predicted may be N days after the train. Are there other reasons? And if so, what is the intuition on knowing how many gaps to include before/after the training set?

Retrained version of GapWalkForward: GapRollForward

The current implementation is based on legacy K-Fold cross-validation requiring an explicit value for the n_splits parameter. It puts the burden of calculating desired value of n_splits on the user.

A better implementation should allow the user to initiate a GapWalkForward class without specifying the value for n_splits. Instead, it can deduct the right value through the other inputs.

It is theoretically desirable to keep both channels of kickstarting a GapWalkForward class. In practice, however, it is hard to maintain both within a single class. Therefore, I decide to ~~deprecate the n_splits channel~~ implement a new class dubbed GapRollForward in v0.1.0 -- the version after the next.

Documentation

Improve the user experience of `gap_train_test_split`

(Optional) better error message when invalid/misspelt keyword parameters are given
check_consistent_length to ensure that all arguments have the same number of rows
Verify that the gap_size argument is a real number
Improve the warning message when both train_size and test_size arguments are given:

The train_size argument is overridden by test_size; in case of nonzero 'gap_size', an explicit value should be provided and cannot be implied by '1 - train_size - test_size'.

Make it work with cross_val_predict

Is it possible to somehow make the CV work with cross_val_predict function. Fore example, if I try:

cv = GapWalkForward(n_splits=3, gap_size=1, test_size=2)
cross_val_predict(estimator=SGDClassifier(), X=X_sample, y=y_bin_sample, cv=cv, n_jobs=6)

it returns an error

ValueError: cross_val_predict only works for partitions

but I would like to have predictions so I can make consfusion matrx and other statistics.

Is it possible to make it work with your cross-validators?

Double count in `n_splits` in `GapLeavePOut`

TSCV/tscv/split.py

Lines 240 to 241 in f8b832f

 n_splits = max(n_samples - gap_after - self.p, 0) 

 n_splits += max(n_samples - self.p - gap_before, 0)

Release 0.0.4 for GridSearch compat

Would it be possible to issue a new release on PyPI to include the latest changes from this commit which aligns the get_n_splits method signature with the abstract method signature required by GridSearchCV?

GapWalkForward Issue with Scikit-learn 0.24.1

When I upgrade to Scikit-learn 0.24.1 I get an issue:

cannot import name 'safe_indexing' from 'sklearn.utils'

This appears to be a change within scikit-learn as indicated here:

https://stackoverflow.com/questions/65602076/yellowbrick-importerror-cannot-import-name-safe-indexing-from-sklearn-utils

No issue using scikit-learn 0.23.2

[Docs] Use this package for Nested Cross-Validation

This issue documents the way to use this package for Nested Cross-Validation. If you have any question, welcome to comment below.

Flat cross-validation vs. nested cross-validation

To clarify the meaning of these two terms in this specific issue, let me first describe them.

Flat cross-validation

Let us use 5-Fold as an example. In a 5-Fold flat cross-validation, you split the dataset into 5 subsets. Each time, you train a model from 4 of them and test it on the remaining one. Afterwards, you average the 5 scores yielded from the 5 test subsets.

ooooo: training subset
*****: test subset

ooooo ooooo ooooo ooooo *****
ooooo ooooo ooooo ***** ooooo
ooooo ooooo ***** ooooo ooooo
ooooo ***** ooooo ooooo ooooo
***** ooooo ooooo ooooo ooooo

Reasonably, the model you trained depends on both the algorithm you use and the hyperparameter you input. Therefore, the averaged score provides a criterion to evaluate both the algorithms and hyperparameters. I will later explain whether these evaluations are accurate enough, but for now it suffices to understand the basic procedure.

Nested cross-validation

In contrast to flat cross-validation, which evaluates both the algorithms and the hyperparameters in one fell swoop, nested cross-validation evaluates them in a hierarchical fashion. In the upper-level, it evaluates the algorithms; in the lower-level, it evaluates the hyperparameter within each algorihtm.

Let us still use the 5-Fold setup. First we, likewise, split the dataset into 5 subsets. Let us call it the macro split, which allows us to run each same experiment 5 times. In each run, we further split the training set into 5 sub-subsets. Let us call it the micro split. If the whole dataset has 25 samples, then the macro split sets 20 samples for training and 5 samples for testing in each run, and the micro split further splits the 20 training samples and sets 16 for training and 4 for testing.

Macro split:

12345 12345 12345 12345 *****   =>  further split to micro split -- No. 1
12345 12345 12345 ***** 12345   =>  further split to micro split -- No. 2
12345 12345 ***** 12345 12345   =>  further split to micro split -- No. 3
12345 ***** 12345 12345 12345   =>  further split to micro split -- No. 4
***** 12345 12345 12345 12345   =>  further split to micro split -- No. 5

(Indicative) micro split -- No. 1 (5 in total):

1111 2222 3333 4444 xxxx
1111 2222 3333 xxxx 5555
1111 2222 xxxx 4444 5555
1111 xxxx 3333 4444 5555
xxxx 2222 3333 4444 5555

In the upper-level macro split, we choose a target algorithm and dive into the lower-level micro split. With the target algorithm fixed, we vary the hyperparameters to get the evaluation for each hyperparameter and choose the optimal one. Then, we return to the upper level by fixing the hyperparameter as the optimal one and evaluate the target algorithm. Then, we choose another target algorithm and repeat the same procedure.

Let us call it 5x5 nested cross-validation. Of course, you can use, in general, a mxn nested cross-validation. The essence is to separate the evaluation of the algorithm from the evaluation of the hyperparameter.

Use nested cross-validation for time series.

In time series cross-validation, you need to introduce gaps, which makes the problem tricky. Luckily, we have an easy walk around. That is, the 2xn nested cross-validation is free:

2x4 nested cross-validation
---------------------------------

Macro split:

ooooo ooooo ooooo ooooo gap *****
***** gap ooooo ooooo ooooo ooooo

Micro split -- No. 1 (2 in total):

oooo oooo oooo gap ****
ooo ooo gap **** gap ooo
ooo gap **** gap ooo ooo
**** gap oooo oooo oooo

You can use my package tscv for this kind of 2xn nested cross-validation.

Why nested cross-validation?

The reason is that the algorithms with more hyperparameters have an edge in flat cross-validation. The dimension of the hyperparameters can be seen as the capacity of "bribery" of the algorithm. The more hyperparameters the algorithm owns, the more severely the algorithm compromise the test dataset. Flat cross-validation, by nature, favours those algorithms with rich hyperparameters. In contrast, nested cross-validation puts every algorithm on the same starting line. That is why nested cross-validation is preferred when comparing algorithms with significantly different dimensions of hyperparameters.

Then, does the nested cross-validation provides an accurate way to evaluate the final chosen model? No, though it help you to pick the best algorithm and its hyperparameter, the resulted model's performance is not under measurement. To explain it, we need some advanced statistics knowledge. To avoid bloating this issue, I will only mention here that model(x*) is different from model(x)|x=x*. The good news, however, is that if your algorithm does not have too many hyperparameters, the cross-validation error will not be too far away from the resulted model's error. Therefore, an algorithm with better performance in nested cross-validation likely leads to a model with better performance in terms of generation error.

Implement Rep-Holdout

Thank you for this repository and the implemented CV-methods; especially GapRollForward. I was looking for exactly this package.

I was wondering if you are interested in implementing another CV-Method for time series, called Rep-Holdout. It is used in this evaluation paper (https://arxiv.org/abs/1905.11744) and has good performance compared to all other CV-methods - some of which you have implemented here.

As I understand it, it is somewhat like sklearn.model_selection.TimeSeriesSplit but with a randomized selection of all possible folds. Here is the description from the paper as an image:

The authors provided code in R but it is written very differently than how it needs to look in Python. I adapted your functions to implement it in python but I am not the best coder and it really only serves my purpose of tuning a specific model. Seeing as the performance of Rep-Holdout is good and -to me at least - it makes sense for time series cross validation, maybe you are interested in adding this function to your package?

Implement rolling overlapping CV

Like this:

    |========****    |
    |  ========****  |
    |    ========****|

May be params n_splits and test_size needed.

split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

Just flagging a minor issue:

We found this after poetry update-ing our dependencies, inadvertently bumping scikit-learn to 0.24.0. This broke code we have that uses tscv

relevant scikit-learn source-code from version 0.23.0
https://github.com/scikit-learn/scikit-learn/blob/0.23.0/sklearn/utils/__init__.py#L274-L275

The method has been made private in scikit-learn 0.24.0: https://github.com/scikit-learn/scikit-learn/blob/0.24.0/sklearn/utils/__init__.py#L271

I did not investigate further, we pinned scikit-learn to 0.23.0 and that's OK for now, but some refactoring may be in order to move off the private method.

	n_splits = max(n_samples - gap_after - self.p, 0)
	n_splits += max(n_samples - self.p - gap_before, 0)