floscha / tabular-dataset Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 90 KB

License: MIT License

Python 100.00%

tabular-dataset's Introduction

Welcome to my GitHub profile!

tabular-dataset's People

Contributors

Watchers

Forkers

harupy

tabular-dataset's Issues

Add quantile normalization

Add a new quantile method to the numeric.normalize transformation, based on scikit-learn's
QuantileTransformer

As different kinds of ML models require different preprocessing steps (e.g. NNs usually expect numerical values to be normalized unlike tree-based models), there should be a number of templates that apply all basic transformations while following the best practices for different model types.

Add type hints

Add type hints to all modules and add mypy to the CI pipeline.

Add resampling option

Add an option to up-/downsample imbalanced datasets.

Automatically infer column types

Add an option to automatically infer column types from a pandas DataFrame based on pandas' data types.

Handle previously unseen labels

Right now, a ValueError is thrown if a label is used during test time that hasn't been seen during training time.

Add Featuretools integration

Add a static .from_featuretools(feature_matrix, features_defs) method to TabularDataset which creates a new TabularDataset object based on the data created by Featuretools.

Add impute method for datetime columns

Add a possibility to impute datetime values with either a specified fixed value or the mode.

Add generated columns back to TabularDataset

E.g. impute transformation should not only add a column to the pandas DataFrame but also create a new binary column.

Add z normalization

Add usage section to readme

Add a Usage section to the README.md file, so that (potential) user can get a quick overview of the library's functionality.

Support date and time variables

Add train/test data support

Currently, all encoders are fit on all data. Instead, it should be possible to fit the encoders on the training data only and then transform both train and test data with those same encoders.

Add cuDF support

Add repr method for TabularDataset object

Add overview to readme

Add an overview section to the readme that explains the main purpose and ideas behind the library.

Add imputed column

Add an additional argument to all implementations of the impute() method that allows adding an additional binary column of 1 when a row has been imputed or 0 else.

Add date_diff transformation

In addition to #30, add a date_diff() transformation that takes as an argument a fixed date or another date column to calculate the difference from.

Split fits on both train and test part

The split method introduced with #48 fits all transformations on both the train and test part of the fold. Instead, only the test part should be used for fitting.

Add aggregate transformations

Add a way to aggregate features using for example

Mean
Median
Mode
Min
Max
Unique
Skew
Kurtosis
Kstat
Percentile

Allow calling fitted data multiple times

Right now, if e.g. the x property of a TabularDataset is called once, the transformers are fitted to the train data. But when called twice, an exception is thrown stating that the transformers have already been fitted. Instead, the fitted transformers should simply be applied without an error.

Add linter

Add pylama for linting and include it in the CI pipeline as well.

Add lineage checks

Given the existing lineage property, it should be possible to create meaningful error messages for cases where transformations are applied in the wrong order.

Add serialization

Add a way to serialize TabularDataframe objects without storing all their data.

Add count encodings

Add a method to encode variables by adding counts of their individual values.

Change impute values from list to dict?

Think about moving the impute_values variable, used for tabular_dataset.transformations.categorical.impute, to a dict data structure, similar to e.g. the various encoders.

Add binning transformation

Add a transformation method that allows binning continuous values into discrete buckets.

Add train/test split

Add a train_test_split method that behaves the same as the scikit-learn equivalent.

Add unit tests for columns argument

Right now the unit tests don't really cover the columns argument for transformations.

Add len() to AllColumns class

Add power transformation

Add dimensionality reduction

Add support for dimensionality reduction techniques such as PCA and t-SNE. They will transform multiple numerical columns into a new output column.

Add drop_first argument for categorical encodings

Add a drop_first argument like pandas' get_dummies() method.

Add abbreviations to access columns

To quicker write code accessing transformations, the following abbreviations could be added:

num for numerical
bin for binary
cat for categorical

Since fewer transformations should usually be applied to target, it might not need an abbreviation.