Welcome to my GitHub profile!
floscha / tabular-dataset Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Welcome to my GitHub profile!
Add a new quantile
method to the numeric.normalize
transformation, based on scikit-learn's
QuantileTransformer
As different kinds of ML models require different preprocessing steps (e.g. NNs usually expect numerical values to be normalized unlike tree-based models), there should be a number of templates that apply all basic transformations while following the best practices for different model types.
Add type hints to all modules and add mypy to the CI pipeline.
Add an option to up-/downsample imbalanced datasets.
Add an option to automatically infer column types from a pandas DataFrame based on pandas' data types.
Right now, a ValueError
is thrown if a label is used during test time that hasn't been seen during training time.
Add a static .from_featuretools(feature_matrix, features_defs)
method to TabularDataset
which creates a new TabularDataset
object based on the data created by Featuretools.
Add a possibility to impute datetime values with either a specified fixed value or the mode.
E.g. impute transformation should not only add a column to the pandas DataFrame but also create a new binary column.
Add a Usage section to the README.md file, so that (potential) user can get a quick overview of the library's functionality.
Currently, all encoders are fit on all data. Instead, it should be possible to fit the encoders on the training data only and then transform both train and test data with those same encoders.
Add an overview section to the readme that explains the main purpose and ideas behind the library.
Add an additional argument to all implementations of the impute()
method that allows adding an additional binary column of 1 when a row has been imputed or 0 else.
In addition to #30, add a date_diff()
transformation that takes as an argument a fixed date or another date column to calculate the difference from.
The split
method introduced with #48 fits all transformations on both the train and test part of the fold. Instead, only the test part should be used for fitting.
Add a way to aggregate features using for example
Right now, if e.g. the x
property of a TabularDataset
is called once, the transformers are fitted to the train data. But when called twice, an exception is thrown stating that the transformers have already been fitted. Instead, the fitted transformers should simply be applied without an error.
Add pylama for linting and include it in the CI pipeline as well.
Given the existing lineage property, it should be possible to create meaningful error messages for cases where transformations are applied in the wrong order.
Add a way to serialize TabularDataframe
objects without storing all their data.
Add a method to encode variables by adding counts of their individual values.
Think about moving the impute_values
variable, used for tabular_dataset.transformations.categorical.impute
, to a dict data structure, similar to e.g. the various encoders.
Add a transformation method that allows binning continuous values into discrete buckets.
Add a train_test_split
method that behaves the same as the scikit-learn equivalent.
Right now the unit tests don't really cover the columns
argument for transformations.
Add support for dimensionality reduction techniques such as PCA and t-SNE. They will transform multiple numerical columns into a new output column.
Add a drop_first
argument like pandas' get_dummies() method.
To quicker write code accessing transformations, the following abbreviations could be added:
Since fewer transformations should usually be applied to target, it might not need an abbreviation.
In addition to #5, it should be easily possible to get k folds for cross-validation, ideally as an iterator to not create all folds at once (thus potentially causing memory issues).
Allow exporting TabularDataset
objects as a list of TensorFlow feature columns.
Use rankdata.
Add a transformation to NumericalColumns
that works like scitkit-learn's normalize and rename the existing normalize
method to scale
.
Add a new power
method to the numeric.normalize
transformation, based on scikit-learn's PowerTransformer.
Also, rename the current power
transformation to exp
to avoid ambiguity.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.