feature-engine / feature_engine Goto Github PK

View Code? Open in Web Editor NEW

1.8K 32.0 305.0 13.73 MB

Feature engineering package with sklearn like functionality

Home Page: https://feature-engine.trainindata.com/

License: BSD 3-Clause "New" or "Revised" License

Python 99.17% TeX 0.83%

machine-learning data-science python scikit-learn feature-engineering feature-selection feature-extraction

feature_engine's Introduction

Feature-engine


Package
Meta
Documentation
Testing

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow Scikit-learn's functionality with fit() and transform() methods to learn the transforming parameters from the data and then transform it.

Feature-engine features in the following resources

Blogs about Feature-engine

Documentation

Documentation

Pst! How did you find us?

We want to share Feature-engine with more people. It'd help us loads if you tell us how you discovered us.

Then we'd know what we are doing right and which channels to use to share the love.

Please share your story by answering 1 quick question at this link . 😃

Current Feature-engine's transformers include functionality for:

Missing Data Imputation
Categorical Encoding
Discretisation
Outlier Capping or Removal
Variable Transformation
Variable Creation
Variable Selection
Datetime Features
Time Series
Preprocessing
Scikit-learn Wrappers

Imputation Methods

MeanMedianImputer
ArbitraryNumberImputer
RandomSampleImputer
EndTailImputer
CategoricalImputer
AddMissingIndicator
DropMissingData

Encoding Methods

OneHotEncoder
OrdinalEncoder
CountFrequencyEncoder
MeanEncoder
WoEEncoder
RareLabelEncoder
DecisionTreeEncoder
StringSimilarityEncoder

Discretisation methods

EqualFrequencyDiscretiser
EqualWidthDiscretiser
GeometricWidthDiscretiser
DecisionTreeDiscretiser
ArbitraryDiscreriser

Outlier Handling methods

Winsorizer
ArbitraryOutlierCapper
OutlierTrimmer

Variable Transformation methods

LogTransformer
LogCpTransformer
ReciprocalTransformer
ArcsinTransformer
PowerTransformer
BoxCoxTransformer
YeoJohnsonTransformer

Variable Creation:

MathFeatures
RelativeFeatures
CyclicalFeatures
DecisionTreeFeatures()

Feature Selection:

DropFeatures
DropConstantFeatures
DropDuplicateFeatures
DropCorrelatedFeatures
SmartCorrelationSelection
ShuffleFeaturesSelector
SelectBySingleFeaturePerformance
SelectByTargetMeanPerformance
RecursiveFeatureElimination
RecursiveFeatureAddition
DropHighPSIFeatures
SelectByInformationValue
ProbeFeatureSelection

Datetime

DatetimeFeatures
DatetimeSubtraction

Time Series

LagFeatures
WindowFeatures
ExpandingWindowFeatures

Pipelines

Pipeline
make_pipeline

Preprocessing

MatchCategories
MatchVariables

Wrappers:

SklearnTransformerWrapper

Installation

From PyPI using pip:

pip install feature_engine

From Anaconda:

conda install -c conda-forge feature_engine

Or simply clone it:

git clone https://github.com/feature-engine/feature_engine.git

Example Usage

>>> import pandas as pd
>>> from feature_engine.encoding import RareLabelEncoder

>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()

Out[1]:
A    10
B    10
C     2
D     1
Name: var_A, dtype: int64

>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()

Out[2]:
A       10
B       10
Rare     3
Name: var_A, dtype: int64

Find more examples in our Jupyter Notebook Gallery or in the documentation.

Contribute

Details about how to contribute can be found in the Contribute Page

Briefly:

Fork the repo
Clone your fork into your local computer:

git clone https://github.com/<YOURUSERNAME>/feature_engine.git

navigate into the repo folder

cd feature_engine

Install Feature-engine as a developer:

pip install -e .

Optional: Create and activate a virtual environment with any tool of choice
Install Feature-engine dependencies:

pip install -r requirements.txt

and

pip install -r test_requirements.txt

Create a feature branch with a meaningful name for your feature:

git checkout -b myfeaturebranch

Develop your feature, tests and documentation
Make sure the tests pass
Make a PR

Thank you!!

Documentation

Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.

To build the documentation make sure you have the dependencies installed: from the root directory:

pip install -r docs/requirements.txt

Now you can build the docs using:

sphinx-build -b html docs build

License

The content of this repository is licensed under a BSD 3-Clause license.

Sponsor us

Sponsor us and support further our mission to democratize machine learning and programming tools through open-source software.

feature_engine's People

Contributors

Stargazers

Watchers

Forkers

anushaindhira arghya05 richardangell pushkarsinha funshoelias ashish1500616 japneetsinghbagga zhongkailv allensmile jxlijunhao typanda peng-89 chrisxuoo liyi722 junior-hao songym2020 victen18 masoud-ghodrati wahe3bru chatkausik nikhilmugganawar billy-odera saitej123 snakesonabrain lihengtianxia giuice patechoc vamsikavuru github553 fudp ghostintheshellarise christophergs dbreddyai saikat1506 ssitb bokucleo askarmel sanyam07 nk6june arthurcab keshava sklabala djoguns jesuisjoyceho schumannc luisfelipeyb moncybigdata gomrinal ololadeoniku aashsach bigrlab sshuster charansai17 ramnathv shicongisme vijaymukkala ntsns98 iahsanujunda myamullaciencia rquintin lucamell stjordanis samarth-agrawal-86 zeadalkhonein arunima-chaudhary tayebimed suryathiru michalgromiec gusajz shivangpy wildessilva tanay0nspark mpofukelvintafadzwa bsrikar richardcsuwandi glevv karthikkothareddy tzo13123 ajain85 emanucrimson julianferry piecot tejash-shah fdoperezi maybeee18 deftbeta timvink thibaultbl rupeshpalwadi hectorpatino jboverio nicogalli pradumna123 abdulrahmannaboulsi sunnyxbd kavithacd mbateman adbmd cristian-dinu-69 ttungl

feature_engine's Issues

SklearnTransformerWrapper error when not defining variables

Describe the bug
SklearnTransformerWrapper error when not defining variables (I'd like to transform all the features).
But when defining any, then it will work.

To Reproduce
I demonstrate standardscaler like below image.

Screenshots

Desktop (please complete the following information):

OS: Windows 8.1
Browser: Chrome
Version: feature-engine 0.5.15, scikit-learn 0.23.1, pandas 1.0.5

Kindly help. Thanks.

improve discretisers jupyter demos

Separate each discretiser into a single jupyter notebook.

Check if new functionality has been added and add an example in the notebook.

improve variable transformers jupyter notebooks

Separate each transformer into a new jupyter notebook.

Check if new functionality has been added and include it in the notebook.

add contribute page

Add contribute page to welcome contributors and explain how the contribution process works.

feature creation: add feature combination by substraction

https://github.com/solegalli/packt_featureengineering_cookbook/blob/master/ch09-mathematical-transformations/Recipe2-Substraction-Quotient-Features.ipynb

add tests for discretisers

create unit tests

review tests for outlier handlers

Are they working with the latest implementation? are they testing every aspect of the class?

Discretiser - return interval boundaries

I would like to suggest discretiser had an option to return interval boundaries instead of an integer, it turns more understandable the variable content output.

for example: outputs 1, 3, 2 turns (0,10] , (20,30], (10,20]

Thanks in advance

Return 0 for unseen labels of a Categorical/Nominal Variable in Test Set or in Production

The given below 2 questions can be answered with this Feature Request
Question -1 : In case a categorical variable in the train set does not have a Rare' label based on values in parameters 'tol' and 'n_categories' , how do we handle unseen values in the test set or in production ?

Question-2 : In the absence of a 'Rare' label in a categorical variable in the train set based on values in parameters 'tol' and 'n_categories', is there any similar mechanism in Feature-Engine like 'handle_unknown='ignore'' as used by sklearn OneHotEncoder for any of the categorical encoding methods implemented in Feature-Engine ?

add transformer that removes features from the dataset

This is useful when we have a feature that we use to create intermediate features, like date of birth, but that we do not want among our predictors. Useful to accomodate the dataset in the sklearn pipeline

review tests for missing data imputation

Are they working with the latest implementation? are they testing every aspect of the class?

tests: separate asserts, tidy, make them more readable

Put different asserts in different tests. Make use of fixtures for long setups

CategoricalVariableImputer

Allow user to select the string with which to replace missing data. At the moment, the only value allowed is the default one == 'missing'

Creation of new feature instead of overwriting the feature

First thanks to the package. It simplifies the overall process. My issue is when using categorical
frequency encoder, why I am not unable to create a new feature instead of overwriting the categorical variable. It will be useful IF I use another categorical encoder function.
Correct me I am wrong if there is feature already in the package.

Code smell

Massive PR:
separate modules with > 500 code lines into relevant modules, all as part of a folder.

Example folder missing, contains multiple scripts for missing data imputers, each containing transformers with similar functionality or one transformer at a time

Draft of utilitiy functions

Hi, as stated in the course here you are the draft of the functions used several times in the course (I tried it and seems to work fine, but I haven't tested it completely):

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# for Q-Q plots
import scipy.stats as stats



def diagnostic_boxplots(df, variable, figsize=None):
    ''' function takes a dataframe (df) 
     the variable of interest as arguments
     and optionally the figsize as tuple, default to (16,4) '''

    # define figure size
    if figsize == None:
        figsize = (16,4)
    plt.figure(figsize=figsize)

    # histogram
    plt.subplot(1, 3, 1)
    sns.distplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()



# function to find upper and lower boundaries
# for normally distributed variables


def find_normal_boundaries(df, variable, distance=3):

    ''' calculate the boundaries outside which sit the outliers
     for a Gaussian distribution given the distance, default 3 '''

    upper_boundary = df[variable].mean() + distance * df[variable].std()
    lower_boundary = df[variable].mean() - distance * df[variable].std()

    return upper_boundary, lower_boundary


# function to find upper and lower boundaries
# for skewed distributed variables


def find_skewed_boundaries(df, variable, distance):

    ''' Let's calculate the boundaries outside which sit the outliers
     for skewed distributions

     distance passed as an argument, gives us the option to
     estimate 1.5 times or 3 times the IQR to calculate
     the boundaries.'''

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary


def diagnostic_plots(df, variable, figsize=None):
    
    ''' function to plot a histogram and a Q-Q plot
    side by side, for a certain variable
    optionally with figsize as tuple
    default to (15,6) '''

    if figsize == None:
        figsize = (15,6)
    plt.figure(figsize=figsize)
    plt.subplot(1, 2, 1)
    df[variable].hist(bins=30)

    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)

    plt.show()


def set_boundaries(df, variable, upper_limit=None, lower_limit = None):
    ''' Set the boundaries, one or both, for the specific variable in the 
    DataFrame (df). '''

    if (upper_limit == None and lower_limit == None):
        return df[variable]
    elif lower_limit == None:
        return np.where(df[variable] > upper_limit, upper_limit, df[variable])
    elif upper_limit == None:
        return np.where(df[variable] < lower_limit, lower_limit, df[variable])
    else:
        return np.where(df[variable] > upper_limit, upper_limit,
                       np.where(df[variable] < lower_limit, lower_limit, df[variable]))


def flag_boundaries(df, variable, upper_limit=None, lower_limit = None):
    ''' Flag the boundaries, one or both, for the specific variable in the 
    DataFrame (df). '''
    if (upper_limit == None and lower_limit == None):
        return False
    elif lower_limit == None:
        return np.where(df[variable] > upper_limit, True, False)   
    elif upper_limit == None:
        return np.where(df[variable] < lower_limit, True, False) 
    else:
        np.where(df[variable] > upper_limit, True,
                       np.where(df[variable] < lower_limit, True, False))




def find_quantile_boundaries(df, variable, lower_quantile=0.05, upper_quantile=0.95):

    ''' Returns  the boundaries as the quantiles, default to 0.05 and 0.95 ''' 

    lower_boundary = df[variable].quantile(lower_quantile)
    upper_boundary = df[variable].quantile(upper_quantile)

    return upper_boundary, lower_boundary

Nothing new but they look convenient to me.

add check for NA in variable transformers

All variable transformers should check if data set contains NA before training or transforming the data

created new features by all categorical features combinations

Is your feature request related to a problem? Please describe.
if we have categorical features how to created new features by all features combinatoric combination
since in real life categorical features are NOT independent , but many of them are dependent from each to others

even scikit learn can not do, but you will?

related to
PacktPublishing/Python-Feature-Engineering-Cookbook#1
Describe the solution you'd like
for example maximum number of combined features is given: or 2 or 4 or 5

for pandas DF you can use
concatenation
https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

so three features combinations from 11 features
features combinatoric combination
seems to be 3 nested loops are not good for this
for i in range(1,11)
for j in range(i+1,11)
for k in range(j+1,11)

you need to get 165 new features from all combinations (not permutations )
then you get many new features

"
Another alternative that I've seen from some Kaggle masters is to join the categories in 2 different variables, into a new categorical variable, so for example, if you have the variable gender, with the values female and male, for observations 1 and 2, and the variable colour with the value blue and green for observations 1 and 2 respectively, you could create a 3rd categorical variable called gender-colour, with the values female-blue for observation 1 and male-green for observation 2. Then you would have to apply the encoding methods from section 3 to this new variable
."

yes do this
but it should not be necessary pandas
also you need to think about RAM use, since it will be a lot of new features
before creating new features
think about converting categorical features to "int" types with small amount of digits from numpy ,

improve outlier jupyter demos

Demonstrate each class into a single jupyter notebook (at the moment all classes are together in 1 notebook)

Check if new functionality has been added, and include a demonstration in the notebook.

transform the feature back to original form

While we use the target guided mean encoding we tend to replace the labels by its mean and fit the algo on top of it. So now if we create decision tree it will show the split of features along with its value using mean values.
It will be better if there is inverse transform function in feature engine which can resort back to the original values of the dataframe.

Requesting for new feature addition in the feature engine library for including the inverse transform.

rare label encoder: add warning when categories in variables are below n_categories

Add a warning to inform users that some of the categorical variables in their data sets contain less categories than the limit indicated in the transformer to perform the rare imputation

improve imputation notebooks

Separate each missing data imputer into a single notebook, within the imputer folder. Check if new parameters or changes have been done to the imputers, and try to include a demo in the notebook.

Allow outlier removers to operate on data with missing values

Is your feature request related to a problem? Please describe.
A typical way of imputing numerical data is by replacing missing values by the mean. However, if the data has outliers, then this approach can be ineffective. But, the outlier remover classes in feature engine cannot be fit on data that has missing values, meaning that imputation must happen first. It would be great if these classes could be compatible with missing data, so that outliers could be capped before imputation occurs.

Describe the solution you'd like
Allow outlier remover classes (e.g. Windsorizer) to fit and transform on data with missing values. The missing values can be ignored in both cases.

add check for NA in categorical encoders

All categorical encoders need to check if data set contains missing values before either training or transforming a data set.

New Features

Hello there!

Firstly, I really would like to thank you for providing this package. Its really boosted my productivity!

Do you have plans for add new features? I felt the missing of scalers, for example.

Edit: Today I have written one function to do the MinMaxScaler and made a PR (https://github.com/solegalli/feature_engine/pull/2) @solegalli

Allow user defined value to replace rare categories

Great package! Thanks for all the work put into it.

Just a small point - In the RareLabelCategoricalEncoder I wondered if it might be nice to allow the user to define a value to replace rare categories with (but still default to 'Rare'). You could perhaps imagine someone wanting all the rare categories subsumed into an existing larger category for example.

add / update contributing docs

The contributing page does not reflect the way in which we are actually working. So we need to re-write it. At the moment we are working with feature branches for each issue and pushing directly to master.

We need to make clear that for code changes, we need to bump the version at each PR, so the new version is automatically uploaded to pypi.

We need to define if we need changes in the version for changes in the docs. I don't think we do, but need to read more about best practices.

code review

Have existing code revised by experienced python developer

DecisionTreeDiscretiser what page to read from CiML-v3-book.pdf

may you clarify what page from http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf is relevant to read about
https://feature-engine.readthedocs.io/en/latest/discretisers/DecisionTreeDiscretiser.html?highlight=DecisionTreeDiscretiser
as you wrote
The methods is inspired by the following article from the winners of the KDD 2009 competition:

http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

but there are 130 pages..

add test to determine if WoERatioCategoricalEncoder returns an error when the probability in the denominator is 0

At the moment the test is commented out:
https://github.com/solegalli/feature_engine/blob/master/tests/test_encoding/test_woe_encoder.py#L112

the aim is to test this bit of code here

I think the test should work, and I commented out because I changed the error for a warning. But now, we decided to go back to returning an error.

So in short, the aim is to corroborate that the commented out test checks the intended bit of code and if yes, uncomment it, otherwise, replace it by the suitable test.

add check for NA in discretisers

All discretisers should check for NA before training or transforming the data set.

Check if return_object is working in the RareLabelEncoder

Add this to tests.

include transformer for datetime variables

The first version of this module should include the following:

New module: datetime (folder)

Three transformers:

ExtractDateFeatures
ExtractTimeFeatures
ExtractDateTimeFeatures

A base class:

DateTimeBaseTransformer

The base class should:

check that the variables entered by the user are datetime, or if string / object, transform them to datetime (I wonder if we should make this a function part of variable_manipulation? and call it in this transfomrer? In case we develop the timeseries module, we may need it later for other modules as well.
have a method to return time features
have a method to return date features
methods to check the input dataframe, and options on what to do if outliers are present

The ExtractDateFeatures should derive the following features:

month, quarter, semester and year (all numeric outputs)
week of the year
is week of the month supported by pandas? if yes, when we should return it
day (numeric 1-31), day of the week (numeric 0-6), and is_weekend (binary)
anything else supported by pandas?
anything else that would be useful

The ExtractTimeFeatures should extract the following features

hr, minute, second
timezone: with parameter return_timezone=False by default, we allow the user to return a time zone categorical feature
unify_timezone=False, to handle different timezones. If True, the transformer is timezone aware, unifies to greenwich and then derived features (should we give option to user to unify to something else? probably yes)
to discuss: is_morning, is_afternoon, is_evening
to discuss: working hrs (I am thinking of passing a string parameter like '9-17' and use those to determine this feature
anything else that would be useful

The ExtractDateTimeFeatures is a sum of the previous transformers, so it should return all possible date and time features.

The reason I suggest to break this into 3 classes is because some timestamps contain only dates, some only times and some both. And I think it would be easier if the user, who knows the timestamp selects the appropriate transformer, instead of adding code to understand which type of timestamp it is, and then derive the features

To consider: should we in version 1 of this transformer return all possible features? or should we give the user an option of which features to return? example, the user may want year and month but not quarter and semester.

Example code in recipes 2 to 5 of the this link..

Things to think about for the transformers design:

Transformer returns all new variables by default, or only those indicated by the user. This behavior could regulated by a parameter in the init. As per previous question, should we leave this for version 2 of this transformer? or add it straightaway?
Transformer should be able to derive features from more than 1 datetime variable at the time, like all feature engine transformers.
Option in Transformer to drops original datetime variables (as those are not useful for machine learning), drop_datetime_variables=True or similar.

Files needed in addition to code files:

add transformer to readme list
add transformer to docs/index.rst
add docs folder exclusive for this transformer with rst files with examples
add jupyter notebooks showing how to use this transformers

This issue can be done in 1 big PR, or multiple PRs, maybe 1 per transformer.

add style checks

Expand flake8 style checks to entire package. At the moment only checking for base_transformers.py. Make use of pre-commit

review including of super() as part of fit() method

Is this good practice? would like to have it reviewed by experienced python developer

woe: re-write formula to ln( P(0) / P(1) )

In the current implementation I think I have P(1) / P(0). Ratio should be inverted to agree with most widespread documentation

example with discretisation followed by monotonic encoding

Add 1 or 2 examples like the one here:
https://github.com/solegalli/packt_featureengineering_cookbook/blob/master/ch05-discretisation/Recipe-3-Discretisation-plus-categorical-encoding.ipynb

where we first discretise a numerical variable, and then order the bins monotonically with the target.

encode categorical variables with trees

Add encder to replace labels directly by the output of a decision tree.

Proposal: monotonic and user defined binning

Hi, first of all thank for the great package and all the efforts put into it.
I would like to propose additional algorithms which would in my opinion add value to the package.

Monotonic WoE binning. This one could be pretty useful for logistic regression problems, particularly (but not limited to) in risk scoring and fraud detection models.
User defined binning, where you specify bin boundaries manually.

Basically, WoE binning algorithm has quite a few Python implementations already, and user defined binning is even more straightforward, being just a matter of couple of code lines; however, I believe it would be beneficial to have both algorithms as part of feature_engine, and especially useful to have those in form of standard transformers (in sklearn style) which will allow saving and retrieving them as preprocessing models using pickle or joblib libraries.

fix docstrings inheritance

docstrings in parent classes are not inherited by sphinx, and therefore not shown in the documentation. Need to do a work around to fix this

review tests for categorical encoders

Are they working with the latest implementation? are they testing every aspect of the class?

Warning message

Whenever I use one of your transform function on a training set...

X_train_enc = ohe_enc.transform(X_train)

I get a warning message...

versions/3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py:933: FutureWarning: Passing attributes to check_is_fitted is deprecated and will be removed in 0.23. The attributes argument is ignored.
"argument is ignored.", FutureWarning)

Using a test set, doesn't produce the error.

I'm using version 0.22 of scikit-learn.

Thank you

add feature selection methods

as per my course feature selection for machine learning. Please get in touch for details

create tests for variable transformers

Add unit tests

add feature combination by math operations

https://github.com/solegalli/packt_featureengineering_cookbook/blob/master/ch09-mathematical-transformations/Recipe1-Add-Multiply-Features.ipynb

improve categorical encoding notebooks

Separate each categorical encoder into a single notebook, within the encoders folder.

Check, if new parameters or functionality has been added and include examples within the notebooks.

The formula to calculate WoE in WoERatioCategoricalEncoder may not be correct:

Hello,
Thank you for creating this package. It covers almost all the hot topics in feature engineering.
One thing I noticed that the formula to calculate WoE in WoERatioCategoricalEncoder may not be correct. The percentage of goods should be the count of goods in that category divided by the total counts of goods in the sample space, not the count of goods divided by the total counts of that category. The same calculation as percentage of bads.
for your example in the doc of WoERatioCategoricalEncoder:
the WoE for 'cabin' variable in titanic dataset should be:
{'cabin': {'B':1.629962
'C':0.721704 ,
'D':1.405081 ,
'E':1.405081 ,
'Rare': 0.738745,
'n': -0.357528}
The calculation is below:
cabin n_obs prop_[1] prop_n_obs n_[0] n_[1] prop_n_[0] prop_n_[1] WoE_correct WoE_in_Doc
n 702 0.304843 0.766376 488 214 0.866785 0.606232 -0.357528 -0.824339
C 71 0.56338 0.077511 31 40 0.055062 0.113314 0.721704 0.254892
Rare 37 0.567568 0.040393 16 21 0.028419 0.05949 0.738745 0.271934
D 32 0.71875 0.034934 9 23 0.015986 0.065156 1.405081 0.93827
E 32 0.71875 0.034934 9 23 0.015986 0.065156 1.405081 0.93827
B 42 0.761905 0.045852 10 32 0.017762 0.090652 1.629962 1.163151

RareLabelCategoricalEncoder checks if new data matches shape of the data it was fitted on and throws exception if they do not match

RareLabelCategoricalEncoder checks if new data matches shape of the data it was fitted on and throws exception when the number of columns do not match.

It makes usage of the encoder in the real setup (unseen data) difficult.

A temporal and ugly workaround is to create dummy columns just to match the number of columns in a new data.

The desired behavior is to apply the encoder on the columns that were defined to undergo the transformation and throw exception if these columns do not exist in a new data set.

arbitrary value imputer: allow imputation with different numbers in 1 imputer

Allow user to enter a dictionary with the values to use to replace the different variables

RareLabelEncoder allow user to decide maximum number of returned categories per variable

Add a parameter with which the user can specify the maximum number of categories that they want for the variables

feature-engine / feature_engine Goto Github PK

feature_engine's Introduction

Feature-engine

Feature-engine features in the following resources

Blogs about Feature-engine

Documentation

Pst! How did you find us?

Current Feature-engine's transformers include functionality for:

Imputation Methods

Encoding Methods

Discretisation methods

Outlier Handling methods

Variable Transformation methods

Variable Creation:

Feature Selection:

Datetime

Time Series

Pipelines

Preprocessing

Wrappers:

Installation

Example Usage

Contribute

Documentation

License

Sponsor us

feature_engine's People

Contributors

Stargazers

Watchers

Forkers

feature_engine's Issues

yes do this but it should not be necessary pandas also you need to think about RAM use, since it will be a lot of new features before creating new features think about converting categorical features to "int" types with small amount of digits from numpy ,

Recommend Projects

Recommend Topics

Recommend Org

yes do this
but it should not be necessary pandas
also you need to think about RAM use, since it will be a lot of new features
before creating new features
think about converting categorical features to "int" types with small amount of digits from numpy ,