Giter Club home page Giter Club logo

tabular-dataset's Introduction

Welcome to my GitHub profile!

tabular-dataset's People

Contributors

floscha avatar

Watchers

 avatar

Forkers

harupy

tabular-dataset's Issues

Add transformation templates

As different kinds of ML models require different preprocessing steps (e.g. NNs usually expect numerical values to be normalized unlike tree-based models), there should be a number of templates that apply all basic transformations while following the best practices for different model types.

Add type hints

Add type hints to all modules and add mypy to the CI pipeline.

Add Featuretools integration

Add a static .from_featuretools(feature_matrix, features_defs) method to TabularDataset which creates a new TabularDataset object based on the data created by Featuretools.

Add usage section to readme

Add a Usage section to the README.md file, so that (potential) user can get a quick overview of the library's functionality.

Add train/test data support

Currently, all encoders are fit on all data. Instead, it should be possible to fit the encoders on the training data only and then transform both train and test data with those same encoders.

Add overview to readme

Add an overview section to the readme that explains the main purpose and ideas behind the library.

Add imputed column

Add an additional argument to all implementations of the impute() method that allows adding an additional binary column of 1 when a row has been imputed or 0 else.

Add date_diff transformation

In addition to #30, add a date_diff() transformation that takes as an argument a fixed date or another date column to calculate the difference from.

Allow calling fitted data multiple times

Right now, if e.g. the x property of a TabularDataset is called once, the transformers are fitted to the train data. But when called twice, an exception is thrown stating that the transformers have already been fitted. Instead, the fitted transformers should simply be applied without an error.

Add linter

Add pylama for linting and include it in the CI pipeline as well.

Add lineage checks

Given the existing lineage property, it should be possible to create meaningful error messages for cases where transformations are applied in the wrong order.

Add serialization

Add a way to serialize TabularDataframe objects without storing all their data.

Add count encodings

Add a method to encode variables by adding counts of their individual values.

Change impute values from list to dict?

Think about moving the impute_values variable, used for tabular_dataset.transformations.categorical.impute, to a dict data structure, similar to e.g. the various encoders.

Add dimensionality reduction

Add support for dimensionality reduction techniques such as PCA and t-SNE. They will transform multiple numerical columns into a new output column.

Add abbreviations to access columns

To quicker write code accessing transformations, the following abbreviations could be added:

  • num for numerical
  • bin for binary
  • cat for categorical

Since fewer transformations should usually be applied to target, it might not need an abbreviation.

Add k-fold cross-validation support

In addition to #5, it should be easily possible to get k folds for cross-validation, ideally as an iterator to not create all folds at once (thus potentially causing memory issues).

Add power normalization

Add a new power method to the numeric.normalize transformation, based on scikit-learn's PowerTransformer.

Also, rename the current power transformation to exp to avoid ambiguity.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.