Giter Club home page Giter Club logo

klarna-incubator / mleko Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 1.0 5.41 MB

Simplify and accelerate your machine learning development with mleko. Designed with modularity and customization in mind, it seamlessly integrates into your existing workflows. Its robust caching system optimizes performance, taking you from data ingestion to finalized models with unparalleled efficiency.

Home Page: https://klarna-incubator.github.io/mleko/

License: Apache License 2.0

Python 99.83% Jinja 0.17%
artificial-intelligence data-science machine-learning pipeline python vaex

mleko's People

Contributors

dependabot[bot] avatar erikbavenstrand avatar ksyula avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ksyula

mleko's Issues

[Bug] Reserved Keywords as Column Names Crashes Vaex

[Bug] Vaex expressions crash when columns are named using Python's reserved keywords

Describe the bug
When columns are named using any of the reserved keywords in Python, it causes some Vaex expressions to crash.

  • True
  • elif
  • in
  • try
  • and
  • else
  • is
  • while
  • as
  • except
  • lambda
  • with
  • assert
  • finally
  • nonlocal
  • yield
  • break
  • for
  • not
  • class
  • from
  • or
  • continue
  • global
  • pass

Expected behavior
The library should append some suffix or prefix to the column names that are Python reserved keywords to prevent such crashes.

[Feature] Interactive Variable Analysis Report

Is your feature request related to a problem? Please describe.
The mleko library currently lacks a comprehensive reporting functionality for the Exploratory Data Analysis (EDA) step. This makes it difficult for users to understand and interpret the results of their EDA.

Describe the solution you'd like
I propose adding a feature to the mleko library that generates a comprehensive report at the end of the EDA process, similar to the ydata-profiling library. This report could include various statistical measures, correlations, missing values, and visualizations.

Describe alternatives you've considered
An alternative solution would be to integrate mleko with an existing data profiling library like pandas-profiling or sweetviz. This would allow users to generate detailed EDA reports without having to write additional code.

Additional context
A comprehensive EDA report can provide valuable insights into the data and help guide the subsequent steps of the machine learning pipeline. It would be a useful addition to the mleko library.

[Feature] Make `cache_directory` Parameter Optional

Is your feature request related to a problem? Please describe.
Currently, the cache_directory parameter is mandatory for all components. This is not always necessary as users may not always need to specify where the cache should be stored.

Describe the solution you'd like
Make the cache_directory parameter optional. This feature request relies on the completion of #102.

Additional context
Making the cache_directory parameter optional would simplify the user interface and make the library easier to use.

[Feature] Add Sampling Using ImbLearn

Is your feature request related to a problem? Please describe.
Working with very unbalanced datasets can lead to poor modeling results. This issue builds on ticket #150.

Describe the solution you'd like
I would like to add a feature that allows for sampling rows in very unbalanced datasets using ImbLearn to improve modeling.

Describe alternatives you've considered
An alternative could be to manually balance the dataset, but this would be time-consuming and may not yield the best results.

Additional context
This feature would make the mleko library more robust and versatile, especially for users dealing with unbalanced datasets.

[Feature] Switch `LGBMModel` to use the Scikit-learn API

Is your feature request related to a problem? Please describe.
The default argument configuration in the current version of the LGBMModel is faulty since it overwrites aliases. This makes the class unpredictable and not suitable for use. Any future arguments added to LGBM will also require manual changes on the end of mleko which is not sustainable.

Describe the solution you'd like
Switch to use the Scikit-learn API of the LGBM and pass an instance of it to the LGBMModel instead. Using the .get_params() method of any BaseEstimator will allow for caching hyperparameter configuration without having to track it manually inside dictionaries.

[Feature] Bayesian Hyperparameter Tuning

Is your feature request related to a problem? Please describe.
Currently, the mleko library supports only grid search and random search for hyperparameter tuning. However, these methods can be time-consuming and inefficient for large hyperparameter spaces.

Describe the solution you'd like
I would like the mleko library to support Bayesian optimization for hyperparameter tuning. This could be implemented using a Python library like Scikit-Optimize or Hyperopt.

Describe alternatives you've considered
Alternatively, mleko could provide a general interface for hyperparameter tuning that allows users to plug in their own optimization algorithms. This would make the library more flexible and adaptable to different use cases.

Additional context
Bayesian optimization is a more efficient method for hyperparameter tuning that can significantly reduce the time required to find the best hyperparameters. It would be a valuable addition to the mleko library.

[Feature] Add Correlation Based Feature Selection (CFS Algorithm)

Is your feature request related to a problem? Please describe.
While correlation feature selection already exists in the mleko library, it does not implement the specific algorithm outlined in this thesis: https://www.cs.waikato.ac.nz/~mhall/thesis.pdf

Describe the solution you'd like
I would like to add a feature that implements the specific correlation based feature selection algorithm described in the provided thesis.

Describe alternatives you've considered
An alternative could be to use the existing correlation feature selection, but it may not provide the same results or benefits as the specific algorithm from the thesis.

Additional context
Implementing this specific algorithm would enhance the feature selection capabilities of the mleko library, potentially leading to better modeling results.

[BUG] Documentation build failed

Describe the bug
The readthedocs build is failing with the following error message:

Running Sphinx v7.2.5
making output directory... done
myst v2.0.0: MdParserConfig(commonmark_only=False, gfm_only=False, enable_extensions=set(), disable_syntax=[], all_links_external=False, url_schemes=('http', 'https', 'mailto', 'ftp'), ref_domains=None, fence_as_directive=set(), number_code_blocks=[], title_to_header=False, heading_anchors=0, heading_slug_func=None, html_meta={}, footnote_transition=True, words_per_minute=200, substitutions={}, linkify_fuzzy_links=True, dmath_allow_labels=True, dmath_allow_space=True, dmath_allow_digits=True, dmath_double_inline=False, update_mathjax=True, mathjax_classes='tex2jax_process|mathjax_process|math|output_area', enable_checkboxes=False, suppress_warnings=[], highlight_code_blocks=True)
/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/autoapi/mappers/python/mapper.py:294: RemovedInSphinx80Warning: The alias 'sphinx.util.status_iterator' is deprecated, use 'sphinx.util.display.status_iterator' instead. Check CHANGES for Sphinx API modifications.
  for dir_root, path in sphinx.util.status_iterator(
�[2K[AutoAPI] Reading files... [  2%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/__init__.py
�[2K[AutoAPI] Reading files... [  3%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/pipeline_step.py
�[2K[AutoAPI] Reading files... [  5%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/pipeline.py
�[2K[AutoAPI] Reading files... [  7%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/__init__.py
�[2K[AutoAPI] Reading files... [  8%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/data_container.py
�[2K[AutoAPI] Reading files... [ 10%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/convert_step.py
�[2K[AutoAPI] Reading files... [ 12%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/model_step.py
�[2K[AutoAPI] Reading files... [ 13%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/ingest_step.py
�[2K[AutoAPI] Reading files... [ 15%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/split_step.py
�[2K[AutoAPI] Reading files... [ 17%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/__init__.py
�[2K[AutoAPI] Reading files... [ 18%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/feature_select_step.py
�[2K[AutoAPI] Reading files... [ 20%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/transform_step.py
�[2K[AutoAPI] Reading files... [ 22%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/cache_mixin.py
�[2K[AutoAPI] Reading files... [ 23%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/lru_cache_mixin.py
�[2K[AutoAPI] Reading files... [ 25%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/__init__.py
�[2K[AutoAPI] Reading files... [ 27%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/vaex_cache_handler.py
�[2K[AutoAPI] Reading files... [ 28%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/pickle_cache_handler.py
�[2K[AutoAPI] Reading files... [ 30%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/joblib_cache_handler.py
�[2K[AutoAPI] Reading files... [ 32%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/__init__.py
�[2K[AutoAPI] Reading files... [ 33%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/base_cache_handler.py
�[2K[AutoAPI] Reading files... [ 35%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/base_fingerprinter.py
�[2K[AutoAPI] Reading files... [ 37%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/vaex_fingerprinter.py
�[2K[AutoAPI] Reading files... [ 38%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/csv_fingerprinter.py
�[2K[AutoAPI] Reading files... [ 40%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/__init__.py
�[2K[AutoAPI] Reading files... [ 42%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/decorators.py
�[2K[AutoAPI] Reading files... [ 43%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/vaex_helpers.py
�[2K[AutoAPI] Reading files... [ 45%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/custom_logger.py
�[2K[AutoAPI] Reading files... [ 47%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/tqdm_helpers.py
�[2K[AutoAPI] Reading files... [ 48%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/__init__.py
�[2K[AutoAPI] Reading files... [ 50%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/file_helpers.py
�[2K[AutoAPI] Reading files... [ 52%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/__init__.py
�[2K[AutoAPI] Reading files... [ 53%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/data_schema.py
�[2K[AutoAPI] Reading files... [ 55%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/base_ingester.py
�[2K[AutoAPI] Reading files... [ 57%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/kaggle_ingester.py
�[2K[AutoAPI] Reading files... [ 58%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/s3_ingester.py
�[2K[AutoAPI] Reading files... [ 60%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/__init__.py
�[2K[AutoAPI] Reading files... [ 62%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/composite_feature_selector.py
�[2K[AutoAPI] Reading files... [ 63%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/pearson_correlation_feature_selector.py
�[2K[AutoAPI] Reading files... [ 65%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/variance_feature_selector.py
�[2K[AutoAPI] Reading files... [ 67%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/invariance_feature_selector.py
�[2K[AutoAPI] Reading files... [ 68%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/base_feature_selector.py
�[2K[AutoAPI] Reading files... [ 70%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/__init__.py
�[2K[AutoAPI] Reading files... [ 72%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/missing_rate_feature_selector.py
�[2K[AutoAPI] Reading files... [ 73%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/expression_splitter.py
�[2K[AutoAPI] Reading files... [ 75%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/base_splitter.py
�[2K[AutoAPI] Reading files... [ 77%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/random_splitter.py
�[2K[AutoAPI] Reading files... [ 78%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/__init__.py
�[2K[AutoAPI] Reading files... [ 80%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/frequency_encoder_transformer.py
�[2K[AutoAPI] Reading files... [ 82%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/composite_transformer.py
�[2K[AutoAPI] Reading files... [ 83%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/label_encoder_transformer.py
�[2K[AutoAPI] Reading files... [ 85%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/min_max_scaler_transformer.py
�[2K[AutoAPI] Reading files... [ 87%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/base_transformer.py
�[2K[AutoAPI] Reading files... [ 88%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/__init__.py
�[2K[AutoAPI] Reading files... [ 90%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/max_abs_scaler_transformer.py
�[2K[AutoAPI] Reading files... [ 92%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/convert/csv_to_vaex_converter.py
�[2K[AutoAPI] Reading files... [ 93%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/convert/base_converter.py
�[2K[AutoAPI] Reading files... [ 95%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/convert/__init__.py
�[2K[AutoAPI] Reading files... [ 97%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/model/base_model.py
�[2K[AutoAPI] Reading files... [ 98%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/model/lgbm_model.py
�[2K[AutoAPI] Reading files... [100%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/model/__init__.py
/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/autoapi/mappers/base.py:294: RemovedInSphinx80Warning: The alias 'sphinx.util.status_iterator' is deprecated, use 'sphinx.util.display.status_iterator' instead. Check CHANGES for Sphinx API modifications.
  for _, data in sphinx.util.status_iterator(
�[2K[AutoAPI] Mapping Data... [  2%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/__init__.py
�[2K[AutoAPI] Mapping Data... [  3%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/pipeline_step.py
�[2K[AutoAPI] Mapping Data... [  5%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/pipeline.py
�[2K[AutoAPI] Mapping Data... [  7%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/__init__.py
�[2K[AutoAPI] Mapping Data... [  8%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/data_container.py
�[2K[AutoAPI] Mapping Data... [ 10%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/convert_step.py
�[2K[AutoAPI] Mapping Data... [ 12%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/model_step.py
�[2K[AutoAPI] Mapping Data... [ 13%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/ingest_step.py
�[2K[AutoAPI] Mapping Data... [ 15%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/split_step.py
�[2K[AutoAPI] Mapping Data... [ 17%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/__init__.py
�[2K[AutoAPI] Mapping Data... [ 18%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/feature_select_step.py
�[2K[AutoAPI] Mapping Data... [ 20%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/pipeline/steps/transform_step.py
�[2K[AutoAPI] Mapping Data... [ 22%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/cache_mixin.py
�[2K[AutoAPI] Mapping Data... [ 23%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/lru_cache_mixin.py
�[2K[AutoAPI] Mapping Data... [ 25%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/__init__.py
�[2K[AutoAPI] Mapping Data... [ 27%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/vaex_cache_handler.py
�[2K[AutoAPI] Mapping Data... [ 28%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/pickle_cache_handler.py
�[2K[AutoAPI] Mapping Data... [ 30%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/joblib_cache_handler.py
�[2K[AutoAPI] Mapping Data... [ 32%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/__init__.py
�[2K[AutoAPI] Mapping Data... [ 33%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/handlers/base_cache_handler.py
�[2K[AutoAPI] Mapping Data... [ 35%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/base_fingerprinter.py
�[2K[AutoAPI] Mapping Data... [ 37%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/vaex_fingerprinter.py
�[2K[AutoAPI] Mapping Data... [ 38%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/csv_fingerprinter.py
�[2K[AutoAPI] Mapping Data... [ 40%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/cache/fingerprinters/__init__.py
�[2K[AutoAPI] Mapping Data... [ 42%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/decorators.py
�[2K[AutoAPI] Mapping Data... [ 43%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/vaex_helpers.py
�[2K[AutoAPI] Mapping Data... [ 45%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/custom_logger.py
�[2K[AutoAPI] Mapping Data... [ 47%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/tqdm_helpers.py
�[2K[AutoAPI] Mapping Data... [ 48%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/__init__.py
�[2K[AutoAPI] Mapping Data... [ 50%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/utils/file_helpers.py
�[2K[AutoAPI] Mapping Data... [ 52%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/__init__.py
�[2K[AutoAPI] Mapping Data... [ 53%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/data_schema.py
�[2K[AutoAPI] Mapping Data... [ 55%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/base_ingester.py
�[2K[AutoAPI] Mapping Data... [ 57%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/kaggle_ingester.py
�[2K[AutoAPI] Mapping Data... [ 58%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/s3_ingester.py
�[2K[AutoAPI] Mapping Data... [ 60%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/ingest/__init__.py
�[2K[AutoAPI] Mapping Data... [ 62%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/composite_feature_selector.py
�[2K[AutoAPI] Mapping Data... [ 63%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/pearson_correlation_feature_selector.py
�[2K[AutoAPI] Mapping Data... [ 65%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/variance_feature_selector.py
�[2K[AutoAPI] Mapping Data... [ 67%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/invariance_feature_selector.py
�[2K[AutoAPI] Mapping Data... [ 68%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/base_feature_selector.py
�[2K[AutoAPI] Mapping Data... [ 70%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/__init__.py
�[2K[AutoAPI] Mapping Data... [ 72%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/feature_select/missing_rate_feature_selector.py
�[2K[AutoAPI] Mapping Data... [ 73%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/expression_splitter.py
�[2K[AutoAPI] Mapping Data... [ 75%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/base_splitter.py
�[2K[AutoAPI] Mapping Data... [ 77%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/random_splitter.py
�[2K[AutoAPI] Mapping Data... [ 78%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/split/__init__.py
�[2K[AutoAPI] Mapping Data... [ 80%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/frequency_encoder_transformer.py
�[2K[AutoAPI] Mapping Data... [ 82%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/composite_transformer.py
�[2K[AutoAPI] Mapping Data... [ 83%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/label_encoder_transformer.py
�[2K[AutoAPI] Mapping Data... [ 85%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/min_max_scaler_transformer.py
�[2K[AutoAPI] Mapping Data... [ 87%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/base_transformer.py
�[2K[AutoAPI] Mapping Data... [ 88%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/__init__.py
�[2K[AutoAPI] Mapping Data... [ 90%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/transform/max_abs_scaler_transformer.py
�[2K[AutoAPI] Mapping Data... [ 92%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/convert/csv_to_vaex_converter.py
�[2K[AutoAPI] Mapping Data... [ 93%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/convert/base_converter.py
�[2K[AutoAPI] Mapping Data... [ 95%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/dataset/convert/__init__.py
�[2K[AutoAPI] Mapping Data... [ 97%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/model/base_model.py
�[2K[AutoAPI] Mapping Data... [ 98%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/model/lgbm_model.py
�[2K[AutoAPI] Mapping Data... [100%] /home/docs/checkouts/readthedocs.org/user_builds/mleko/checkouts/stable/mleko/model/__init__.py
/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/autoapi/mappers/base.py:312: RemovedInSphinx80Warning: The alias 'sphinx.util.status_iterator' is deprecated, use 'sphinx.util.display.status_iterator' instead. Check CHANGES for Sphinx API modifications.
  for _, obj in sphinx.util.status_iterator(
[AutoAPI] Rendering Data... [  2%] mleko
[AutoAPI] Rendering Data... [  3%] mleko.pipeline.pipeline_step
[AutoAPI] Rendering Data... [  5%] mleko.pipeline.pipeline
[AutoAPI] Rendering Data... [  7%] mleko.pipeline
[AutoAPI] Rendering Data... [  8%] mleko.pipeline.data_container
[AutoAPI] Rendering Data... [ 10%] mleko.pipeline.steps.convert_step
[AutoAPI] Rendering Data... [ 12%] mleko.pipeline.steps.model_step
[AutoAPI] Rendering Data... [ 13%] mleko.pipeline.steps.ingest_step
[AutoAPI] Rendering Data... [ 15%] mleko.pipeline.steps.split_step
[AutoAPI] Rendering Data... [ 17%] mleko.pipeline.steps
[AutoAPI] Rendering Data... [ 18%] mleko.pipeline.steps.feature_select_step
[AutoAPI] Rendering Data... [ 20%] mleko.pipeline.steps.transform_step
[AutoAPI] Rendering Data... [ 22%] mleko.cache.cache_mixin
[AutoAPI] Rendering Data... [ 23%] mleko.cache.lru_cache_mixin
[AutoAPI] Rendering Data... [ 25%] mleko.cache
[AutoAPI] Rendering Data... [ 27%] mleko.cache.handlers.vaex_cache_handler
[AutoAPI] Rendering Data... [ 28%] mleko.cache.handlers.pickle_cache_handler
[AutoAPI] Rendering Data... [ 30%] mleko.cache.handlers.joblib_cache_handler
[AutoAPI] Rendering Data... [ 32%] mleko.cache.handlers
[AutoAPI] Rendering Data... [ 33%] mleko.cache.handlers.base_cache_handler
[AutoAPI] Rendering Data... [ 35%] mleko.cache.fingerprinters.base_fingerprinter
[AutoAPI] Rendering Data... [ 37%] mleko.cache.fingerprinters.vaex_fingerprinter
[AutoAPI] Rendering Data... [ 38%] mleko.cache.fingerprinters.csv_fingerprinter
[AutoAPI] Rendering Data... [ 40%] mleko.cache.fingerprinters
[AutoAPI] Rendering Data... [ 42%] mleko.utils.decorators
[AutoAPI] Rendering Data... [ 43%] mleko.utils.vaex_helpers
[AutoAPI] Rendering Data... [ 45%] mleko.utils.custom_logger
[AutoAPI] Rendering Data... [ 47%] mleko.utils.tqdm_helpers
[AutoAPI] Rendering Data... [ 48%] mleko.utils
[AutoAPI] Rendering Data... [ 50%] mleko.utils.file_helpers
[AutoAPI] Rendering Data... [ 52%] mleko.dataset
[AutoAPI] Rendering Data... [ 53%] mleko.dataset.data_schema
[AutoAPI] Rendering Data... [ 55%] mleko.dataset.ingest.base_ingester
[AutoAPI] Rendering Data... [ 57%] mleko.dataset.ingest.kaggle_ingester
[AutoAPI] Rendering Data... [ 58%] mleko.dataset.ingest.s3_ingester
[AutoAPI] Rendering Data... [ 60%] mleko.dataset.ingest
[AutoAPI] Rendering Data... [ 62%] mleko.dataset.feature_select.composite_feature_selector
[AutoAPI] Rendering Data... [ 63%] mleko.dataset.feature_select.pearson_correlation_feature_selector
[AutoAPI] Rendering Data... [ 65%] mleko.dataset.feature_select.variance_feature_selector
[AutoAPI] Rendering Data... [ 67%] mleko.dataset.feature_select.invariance_feature_selector
[AutoAPI] Rendering Data... [ 68%] mleko.dataset.feature_select.base_feature_selector
[AutoAPI] Rendering Data... [ 70%] mleko.dataset.feature_select
[AutoAPI] Rendering Data... [ 72%] mleko.dataset.feature_select.missing_rate_feature_selector
[AutoAPI] Rendering Data... [ 73%] mleko.dataset.split.expression_splitter
[AutoAPI] Rendering Data... [ 75%] mleko.dataset.split.base_splitter
[AutoAPI] Rendering Data... [ 77%] mleko.dataset.split.random_splitter
[AutoAPI] Rendering Data... [ 78%] mleko.dataset.split
[AutoAPI] Rendering Data... [ 80%] mleko.dataset.transform.frequency_encoder_transformer
[AutoAPI] Rendering Data... [ 82%] mleko.dataset.transform.composite_transformer
[AutoAPI] Rendering Data... [ 83%] mleko.dataset.transform.label_encoder_transformer
[AutoAPI] Rendering Data... [ 85%] mleko.dataset.transform.min_max_scaler_transformer
[AutoAPI] Rendering Data... [ 87%] mleko.dataset.transform.base_transformer
[AutoAPI] Rendering Data... [ 88%] mleko.dataset.transform
[AutoAPI] Rendering Data... [ 90%] mleko.dataset.transform.max_abs_scaler_transformer
[AutoAPI] Rendering Data... [ 92%] mleko.dataset.convert.csv_to_vaex_converter
[AutoAPI] Rendering Data... [ 93%] mleko.dataset.convert.base_converter
[AutoAPI] Rendering Data... [ 95%] mleko.dataset.convert
[AutoAPI] Rendering Data... [ 97%] mleko.model.base_model
[AutoAPI] Rendering Data... [ 98%] mleko.model.lgbm_model
[AutoAPI] Rendering Data... [100%] mleko.model

[autosummary] generating autosummary for: contributing.md, index.md, license.md, usage.md

Traceback (most recent call last):
  File "/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/sphinx/cmd/build.py", line 293, in build_main
    app = Sphinx(args.sourcedir, args.confdir, args.outputdir,
  File "/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/sphinx/application.py", line 272, in __init__
    self._init_builder()
  File "/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/sphinx/application.py", line 343, in _init_builder
    self.events.emit('builder-inited')
  File "/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/sphinx/events.py", line 97, in emit
    results.append(listener.handler(self.app, *args))
  File "/home/docs/checkouts/readthedocs.org/user_builds/mleko/envs/stable/lib/python3.10/site-packages/furo/__init__.py", line 234, in _builder_inited
    raise ConfigError(
sphinx.errors.ConfigError: Furo is being used as an extension in a non-HTML build. This should not happen.

Configuration error:
Furo is being used as an extension in a non-HTML build. This should not happen.

[Feature] README Update

Is your feature request related to a problem? Please describe.
The current README document for our project is not as informative or clear as it could be. The README is often the first point of contact for people interested in our project, so it's crucial that it provides a good first impression. A well-written, informative README can attract more contributors to our project, and can make it easier for everyone to understand the project.

Describe the solution you'd like
I would like the README to be completely rewritten to include detailed information about the project, including its purpose, how it works, how to install and use it, and how to contribute to it. The README should also include a table of contents for easy navigation, as well as links to any relevant resources or documentation.

[Bug] Meta columns not being cast using the forced types

Describe the bug
Currently, meta columns in the mleko are not being cast using the forced types using the CSVToVaexConverter. This results in incorrect datatype for these columns.

Expected behavior
Meta columns should be cast using the forced types before they are assigned as meta columns.

Additional context
This issue may cause inconsistencies in the data and affect the results of the machine learning model.

[Feature] Add `ExpressionTransformer` for creating columns using `vaex` expressions

Is your feature request related to a problem? Please describe.
Currently, there is no built-in functionality to create new columns based on expressions in the mleko library.

Describe the solution you'd like
I would like to add an ExpressionTransformer that takes a dictionary of expressions in this format: dict[str, tuple[str, DataType]]. This transformer would create new columns according to the provided dictionary. The key in the dictionary is the new feature name, and the tuple value is the expression and the column data type.

Describe alternatives you've considered
An alternative could be to manually create new columns based on expressions, but having a built-in transformer in the mleko library would make the process more streamlined and consistent.

Additional context
This feature would be a valuable addition to the mleko library, making it easier for users to create new features based on expressions.

[Bug] CSVToVaexConverter crashes when encountering a column with an empty name

Describe the bug
The CSVToVaexConverter crashes when it encounters a column with an empty name ("").

To Reproduce
Steps to reproduce the behavior:

  1. Load a CSV file with a column having an empty name using CSVToVaexConverter
  2. See error

Expected behavior
The CSVToVaexConverter should either automatically fix the issue or throw a more descriptive error message specifically checking for empty column names.

Environment (please complete the following information):

  • OS: *
  • Python version: *
  • MLEKO version: 1.1.0

[Feature] Add `Scikit-learn` Decision Tree Support

Is your feature request related to a problem? Please describe.
Currently, there is no general support for scikit-learn in the mleko library.

Describe the solution you'd like
I would like to add a feature that provides general support forscikit-learn, allowing users to easily integratescikit-learn models and functions into their mleko pipelines.

Describe alternatives you've considered
An alternative could be to manually integrate scikit-learn into the mleko pipeline, but this would be less efficient and may lead to inconsistencies.

Additional context
scikit-learn is a widely used machine learning library, and its support would greatly enhance the functionality and versatility of the mleko library.

[BUG] Release Build Fails on No Version Change

Describe the bug
The GitHub Action for release fails when changes are pushed to the repository that do not update the code, leading to no change in the semantic versioning. This results in an unnecessary failure of the GitHub Action, even though the state of the codebase is still valid.

To Reproduce
Steps to reproduce the behavior:

  1. Make a commit that does not change any code (e.g., update documentation, change README, etc.)
  2. Push the commit to the repository.
  3. Observe the GitHub Actions tab and see that the release action fails.

Expected behavior
The GitHub Action for release should not fail when non-code changes are pushed. It should either ignore these changes or handle them gracefully, without causing the entire action to fail.

Additional context
This bug can cause confusion and unnecessary alarm, as it makes it appear as though there is a problem with the codebase when there is not. It also makes the GitHub Actions tab more difficult to navigate, as it will be cluttered with failed actions.

[Feature] Model-Based Feature Selection

Is your feature request related to a problem? Please describe.
Currently, the mleko library does not provide support for model-based feature selection. This is a problem as it leaves users with limited options for feature selection.

Describe the solution you'd like
I would like to see the integration of model-based feature selection methods in mleko. Specifically, the following methods would be beneficial:

  1. Lasso (L1) regularization: This method can help in feature selection by shrinking the coefficients of less important features to zero.

  2. Tree-based feature selection: Decision trees or tree-based ensemble methods (like Random Forests and Gradient Boosting) can be used to rank features based on their importance.

  3. Recursive Feature Elimination (RFE): This is a greedy optimization algorithm which aims to find the best performing feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each iteration.

Describe alternatives you've considered
An alternative could be to use other libraries or tools that provide model-based feature selection. However, having this functionality integrated within mleko would make the workflow more efficient.

Additional context
Model-based feature selection methods can provide more accurate and reliable feature selection, which can lead to better model performance. Therefore, adding these methods would greatly enhance the functionality of the mleko library.

[Bug] Null value crashes transform using label encoder even if allow_unseen is set to True

Describe the bug
A null value not seen during fit_transform will crash a subsequent transform using the label encoder, even if allow_unseen is set to True.

To Reproduce
Steps to reproduce the behavior:

  1. Use fit_transform with a dataset that doesn't contain null values
  2. Use transform with a dataset that contains null values

Expected behavior
The label encoder should have an additional parameter which controls if null values should be considered as a independent category, or simply a identity transformation of null -> null.

Additional context
Possible solutions include creating our own label encoder, fixing Vaex through a PR, or inheriting from Vaex and fixing inmleko. I prefer the first option.

[Feature] CatBoost Model

Is your feature request related to a problem? Please describe.
Currently, the mleko library does not support CatBoost, a machine learning algorithm that provides high performance and handles categorical features well.

Describe the solution you'd like
I would like to see CatBoost integrated into the mleko library. This would provide users with more options for machine learning algorithms and enhance the library's capabilities.

Describe alternatives you've considered
An alternative would be to use CatBoost separately and then use mleko for other parts of the machine learning pipeline. However, having everything integrated into one library would be more convenient.

Additional context
CatBoost is a powerful machine learning algorithm that is widely used in the industry. Its inclusion would greatly enhance the capabilities of the mleko library.

[Feature] Add Simple ExpressionFilter for removing rows

Is your feature request related to a problem? Please describe.
There's currently no straightforward way to remove rows based on certain conditions in the mleko library.

Describe the solution you'd like
I propose adding a simple ExpressionFilter feature that allows users to specify conditions to filter out rows. This could be based on values, indexes, or any other user-defined criteria.

Describe alternatives you've considered
An alternative could be to manually remove the rows after processing, but this can be cumbersome and inefficient.

Additional context
This feature would enhance the data preprocessing capabilities of the mleko library, making it more versatile and user-friendly. As a first step, the focus should be on simplicity and ease of use.

[Bug] Check cache file for corruption on cache hit

Describe the bug
If mleko crashes while writing a cache file, it might become corrupted. It can lead to weird and hard to figure out errors, so to fix we should perform sanity checking when reading from cache.

To Reproduce

  1. Crash during writing of Vaex DF on purpose.
  2. Try reading Vaex DF from disk and see that it fails to display even though the cached function executed correctly.

Expected behavior
The caching behavior should stop or crash the program on purpose if a corrupted cache file is detected.

[Feature] Improve General Documentation

Is your feature request related to a problem? Please describe.
While the automatically generated documentation from docstrings is useful for understanding the technical details of the mleko library, it is not always sufficient for providing a comprehensive understanding of the library's usage, especially for newcomers. I find it difficult to understand how to use the library in a larger context, such as how different components interact with each other, best practices for using the library, and real-world examples.

Describe the solution you'd like
I propose that we supplement the existing docstring-based documentation with more free text-based documentation. This could include tutorials, how-to guides, explanations of the library's design philosophy, and more detailed descriptions of the library's components. In addition, we should provide more real-world examples of how to use the library to solve common problems in model building.

Describe alternatives you've considered
An alternative solution would be to enhance the docstrings with more detailed examples and usage scenarios. However, this might make the docstrings overly verbose and difficult to navigate. Furthermore, it would not provide a suitable place for the more high-level and conceptual documentation that I propose.

Additional context
Good documentation can significantly lower the barrier to entry for new users and can also serve as a valuable reference for existing users. By providing more comprehensive documentation, we can make the MLEKO library more accessible and easier to use.

[Feature] Add OneHotEncoderTransformer

Is your feature request related to a problem? Please describe.
Currently, there is no built-in functionality to perform one-hot encoding on categorical data in the mleko library.

Describe the solution you'd like
I would like to add a OneHotEncoderTransformer to the library, which would allow users to easily perform one-hot encoding on their categorical data.

Describe alternatives you've considered
An alternative could be to use pandas' get_dummies function or sklearn's OneHotEncoder, but having a built-in transformer in the mleko library would make the process more streamlined and consistent.

Additional context
One-hot encoding is a common preprocessing step in machine learning pipelines, so this feature would be a valuable addition to the mleko library.

[Feature] Add Support for Reading Manifest Files from S3 Buckets

Is your feature request related to a problem? Please describe.
Currently, there is no support for reading manifest files directly from S3 buckets and translating them to the local manifest format in the mleko library.

Describe the solution you'd like
I would like to add a feature that allows for reading manifest files from S3 buckets and translating them to the local manifest format.

Describe alternatives you've considered
An alternative could be to manually download the manifest files from the S3 bucket and then translate them to the local format, but this would be less efficient and more error-prone.

Additional context
This feature would greatly enhance the functionality of the mleko library and make it more useful for users who work with AWS S3.

[Bug] Ingesters Attempting Deletion of Other Directories

Describe the bug
When fetching data using an ingester and specifying a directory with subdirectories (potentially caches from other functions), the ingesters attempt to delete those as well.

To Reproduce

s3_ingester = S3Ingester(
    destination_directory="data",
    s3_bucket_name="mleko-datasets",
    s3_key_prefix="kaggle/ashishpatel26/indian-food-101",
    aws_profile_name="mleko",
    aws_region_name="eu-west-1",
    num_workers=64,
    manifest_file_name="manifest",
    check_s3_timestamps=True,
)
s3_ingester.fetch_data()

# Create directory inside "./data"
os.mkdir("./data/some_other_cache")

# Will crash due to trying to unlink a directory
s3_ingester.fetch_data(force_recompute=True)

Expected behavior
The ingesters should only delete the files related to the ingester's operation, not other subdirectories or files within the specified directory.

Environment (please complete the following information):

  • OS: *
  • Python version: *
  • MLEKO version: 1.1.0

Additional context
A potential solution could be creating a local manifest file after each successful download, specifying file names and sizes. Then, only those files would be deleted when we "clear" the ingester cache. Also, documentation on the destination directory should be updated, as currently it's set as "data".

[Feature] Example Notebook

Is your feature request related to a problem? Please describe.
While mleko is a comprehensive and powerful library, new users may find it challenging to understand how to use it effectively on real-world datasets. Currently, there is a lack of detailed, practical examples that show how to use mleko's features in a realistic context.

Describe the solution you'd like
I propose that we create an example Jupyter notebook that showcases how to use mleko on a well-known dataset. This notebook should include:

  • Comprehensive descriptions of each step, explaining not just how to use mleko's features, but why we are using them and what effect they have on the data.
  • Well-documented code that adheres to Python's best practices for readability and maintainability.
  • A structure suitable for a live demo, with clear delineations between different sections and steps of the data science process.

Describe alternatives you've considered
An alternative could be to enhance the existing documentation with more examples. However, a standalone example notebook has several advantages. It would provide a more realistic and practical context, and it could be used as a basis for live demos or tutorials.

Additional context
This notebook would serve as a valuable resource for both new and existing users. New users can follow along with the notebook to understand how to use mleko, while existing users can use it as a reference for how to apply mleko's features to their own datasets.

[Feature] Add optuna-dashboard support

Is your feature request related to a problem? Please describe.
When running hyperparameter tuning using Optuna it is often beneficial to visualize the search space.

Describe the solution you'd like
Adding the required parameters to the OptunaTuner that allows users to configure a destination file for the experiments.

[Feature] Add support for custom evaluation functions in LGBMModel

Is your feature request related to a problem? Please describe.
Currently, there is no way to add custom evaluation functions to the LGBMModel in the mleko library.

Describe the solution you'd like
I would like to have a feature that allows users to add their own custom evaluation functions to the LGBMModel. This would make the library more flexible and adaptable to various use-cases.

Describe alternatives you've considered
An alternative could be to allow users to extend the LGBMModel class and override the evaluation method, but this might be more complex and less user-friendly.

Additional context
This feature would be particularly useful for users who need to use custom metrics for model evaluation that are not currently supported by the library.

[Feature] Basic Hyperparameter Tuning

Describe the solution you'd like
When building models it is common practice to run hyperparameter tuning to find a suitable set of hyperparameters for the final model fitting.

Describe alternatives you've considered
Preferably we start small and implement some simple hyperparameter tuning algorithms such as GridSearch or RandomSearch before moving on to Bayesian approaches.

Additional context
The model and pipeline are already set up to accept hyperparameters during fitting, so no major overhaul needed.

[Feature] Handling Imbalanced Datasets

Is your feature request related to a problem? Please describe.
Handling imbalanced datasets is a common problem in Machine Learning. Currently, mleko library does not provide any functionality to handle this issue.

Describe the solution you'd like
I would like mleko to include a feature that addresses label imbalance. Integrating a tool like ImbLearn could be a good starting point.

Describe alternatives you've considered
An alternative would be to use ImbLearn or other similar tools separately. However, integrating it into mleko would make the workflow more streamlined and efficient.

Additional context
Handling imbalanced datasets is crucial in many real-world applications, therefore, adding this feature would increase the practical usability of the mleko library.

[Feature] Model Calibration, Logistic Regression

Is your feature request related to a problem? Please describe.
Currently, the mleko library does not support model calibration. This is a problem when trying to improve the reliability of probability estimates in models such as Logistic Regression.

Describe the solution you'd like
I would like a feature to be added that supports model calibration. Initially, adding support for Logistic Regression would be sufficient. Ideally, this would be implemented via Platt Scaling.

Describe alternatives you've considered
An alternative solution would be to manually implement Platt Scaling outside of the mleko library. However, this is less efficient and would require extra work for each project.

Additional context
Model calibration can improve the reliability of probability estimates, which is crucial in many Machine Learning applications. Therefore, adding this feature would greatly enhance the utility of the mleko library.

[Feature] Creation of a Comprehensive USAGE.md File

Is your feature request related to a problem? Please describe.
Currently, there is no USAGE.md file in the mleko library. This makes it difficult for new users to understand how to use the library effectively.

Describe the solution you'd like
A USAGE.md file should be created that provides comprehensive instructions on how to use the mleko library. This should include examples of common use cases, explanations of key functions and parameters, and any other information that would be helpful to users.

Describe alternatives you've considered
An alternative could be to include usage instructions in the README.md file. However, having a separate USAGE.md file would be more organized and easier for users to navigate.

Additional context
Having clear and comprehensive usage instructions is crucial for any library. This would greatly enhance the usability of the mleko library and help users get the most out of its features.

[Feature] Add Base Egestor and S3Egestor

Is your feature request related to a problem? Please describe.
There is currently no way to unload data to a specific format to a S3 bucket in the mleko library.

Describe the solution you'd like
I would like to add a base class Egestor and a S3Egestor that can unload some data to a specific format to a S3 bucket. Should also create S3 compatible manifest files.

Describe alternatives you've considered
An alternative could be to manually unload the data and then upload it to the S3 bucket, but this would be less efficient and more error-prone.

Additional context
This feature would greatly enhance the functionality of the mleko library and make it more useful for users who work with AWS S3.

[Bug] Booleans are stored as strings instead of ints after using CSVToVaexConverter

Describe the bug
Booleans are stored as strings instead of ints after using the CSVToVaexConverter.

To Reproduce
Steps to reproduce the behavior:

  1. Load a CSV file with boolean values using CSVToVaexConverter
  2. Check the data types of the loaded dataframe
  3. Notice that boolean values are stored as strings

Expected behavior
Booleans should be stored as integers (0 or 1) after using the CSVToVaexConverter.

Environment (please complete the following information):

  • OS: *
  • Python version: *
  • MLEKO version: 1.1.0

[Feature] LabelEncoderTransformer transforms Null to its own category

Is your feature request related to a problem? Please describe.
The LabelEncoderTransformer is currently transforming Null values into its own category. It's unclear whether this should be the case or not.

Describe the solution you'd like
Consider adding an additional parameter to control this behavior. If the parameter is set to True, Null values should be transformed into its own category. If False, Null values should be preserved as is.

Describe alternatives you've considered
An alternative could be to leave the current behavior as the default but provide clear documentation on how Null values are handled.

[Feature] Add use_cache Argument to Cached Functions

Is your feature request related to a problem? Please describe.
Currently, there is no way to bypass the cache in cached functions in the mleko library. This can be an issue during tasks like hyperparameter tuning.

Describe the solution you'd like
I would like to add a boolean argument, use_cache, to all cached functions. When set to False, this argument would bypass the cache. The default value would be True.

Describe alternatives you've considered
An alternative could be to manually call the uncached variants prefixed with an underscore, but it would not be very intuitive.

Additional context
This feature would make the mleko library more flexible and user-friendly, especially for users who are performing tasks like hyperparameter tuning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.