Giter Club home page Giter Club logo

feature_engine's Introduction

Feature-engine

feature-engine logo

Package PyPI - Python Version PyPI Conda Monthly Downloads Downloads
Meta GitHub GitHub contributors Gitter first-timers-only Sponsorship
Documentation Read the Docs DOI JOSS
Testing CircleCI Codecov Code style: black

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow Scikit-learn's functionality with fit() and transform() methods to learn the transforming parameters from the data and then transform it.

Feature-engine features in the following resources

Blogs about Feature-engine

Documentation

Pst! How did you find us?

We want to share Feature-engine with more people. It'd help us loads if you tell us how you discovered us.

Then we'd know what we are doing right and which channels to use to share the love.

Please share your story by answering 1 quick question at this link . ๐Ÿ˜ƒ

Current Feature-engine's transformers include functionality for:

  • Missing Data Imputation
  • Categorical Encoding
  • Discretisation
  • Outlier Capping or Removal
  • Variable Transformation
  • Variable Creation
  • Variable Selection
  • Datetime Features
  • Time Series
  • Preprocessing
  • Scikit-learn Wrappers

Imputation Methods

  • MeanMedianImputer
  • ArbitraryNumberImputer
  • RandomSampleImputer
  • EndTailImputer
  • CategoricalImputer
  • AddMissingIndicator
  • DropMissingData

Encoding Methods

  • OneHotEncoder
  • OrdinalEncoder
  • CountFrequencyEncoder
  • MeanEncoder
  • WoEEncoder
  • RareLabelEncoder
  • DecisionTreeEncoder
  • StringSimilarityEncoder

Discretisation methods

  • EqualFrequencyDiscretiser
  • EqualWidthDiscretiser
  • GeometricWidthDiscretiser
  • DecisionTreeDiscretiser
  • ArbitraryDiscreriser

Outlier Handling methods

  • Winsorizer
  • ArbitraryOutlierCapper
  • OutlierTrimmer

Variable Transformation methods

  • LogTransformer
  • LogCpTransformer
  • ReciprocalTransformer
  • ArcsinTransformer
  • PowerTransformer
  • BoxCoxTransformer
  • YeoJohnsonTransformer

Variable Creation:

  • MathFeatures
  • RelativeFeatures
  • CyclicalFeatures
  • DecisionTreeFeatures()

Feature Selection:

  • DropFeatures
  • DropConstantFeatures
  • DropDuplicateFeatures
  • DropCorrelatedFeatures
  • SmartCorrelationSelection
  • ShuffleFeaturesSelector
  • SelectBySingleFeaturePerformance
  • SelectByTargetMeanPerformance
  • RecursiveFeatureElimination
  • RecursiveFeatureAddition
  • DropHighPSIFeatures
  • SelectByInformationValue
  • ProbeFeatureSelection

Datetime

  • DatetimeFeatures
  • DatetimeSubtraction

Time Series

  • LagFeatures
  • WindowFeatures
  • ExpandingWindowFeatures

Pipelines

  • Pipeline
  • make_pipeline

Preprocessing

  • MatchCategories
  • MatchVariables

Wrappers:

  • SklearnTransformerWrapper

Installation

From PyPI using pip:

pip install feature_engine

From Anaconda:

conda install -c conda-forge feature_engine

Or simply clone it:

git clone https://github.com/feature-engine/feature_engine.git

Example Usage

>>> import pandas as pd
>>> from feature_engine.encoding import RareLabelEncoder

>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()
Out[1]:
A    10
B    10
C     2
D     1
Name: var_A, dtype: int64
>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()
Out[2]:
A       10
B       10
Rare     3
Name: var_A, dtype: int64

Find more examples in our Jupyter Notebook Gallery or in the documentation.

Contribute

Details about how to contribute can be found in the Contribute Page

Briefly:

  • Fork the repo
  • Clone your fork into your local computer:
git clone https://github.com/<YOURUSERNAME>/feature_engine.git
  • navigate into the repo folder
cd feature_engine
  • Install Feature-engine as a developer:
pip install -e .
  • Optional: Create and activate a virtual environment with any tool of choice
  • Install Feature-engine dependencies:
pip install -r requirements.txt

and

pip install -r test_requirements.txt
  • Create a feature branch with a meaningful name for your feature:
git checkout -b myfeaturebranch
  • Develop your feature, tests and documentation
  • Make sure the tests pass
  • Make a PR

Thank you!!

Documentation

Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.

To build the documentation make sure you have the dependencies installed: from the root directory:

pip install -r docs/requirements.txt

Now you can build the docs using:

sphinx-build -b html docs build

License

The content of this repository is licensed under a BSD 3-Clause license.

Sponsor us

Sponsor us and support further our mission to democratize machine learning and programming tools through open-source software.

feature_engine's People

Contributors

ashok49473 avatar christophergs avatar cmcouto-silva avatar datacuber avatar david-cortes avatar dodoarg avatar elamraoui-sohayb avatar glevv avatar gurjinderbassi avatar gverbock avatar hectorpatino avatar karthikkothareddy avatar kishmanani avatar luismavs avatar michalgromiec avatar morgan-sell avatar noahjgreen295 avatar okroshiashvili avatar pradumna123 avatar safirangi avatar sangamswadik avatar sergiobemar avatar solegalli avatar sunnyxbd avatar suryathiru avatar tejash-shah avatar thibaultbl avatar timvink avatar tremamiguel avatar wahe3bru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

feature_engine's Issues

SklearnTransformerWrapper error when not defining variables

Describe the bug
SklearnTransformerWrapper error when not defining variables (I'd like to transform all the features).
But when defining any, then it will work.

To Reproduce
I demonstrate standardscaler like below image.

Screenshots
image

Desktop (please complete the following information):

  • OS: Windows 8.1
  • Browser: Chrome
  • Version: feature-engine 0.5.15, scikit-learn 0.23.1, pandas 1.0.5

Kindly help. Thanks.

improve discretisers jupyter demos

Separate each discretiser into a single jupyter notebook.

Check if new functionality has been added and add an example in the notebook.

add contribute page

Add contribute page to welcome contributors and explain how the contribution process works.

Discretiser - return interval boundaries

I would like to suggest discretiser had an option to return interval boundaries instead of an integer, it turns more understandable the variable content output.

for example: outputs 1, 3, 2 turns (0,10] , (20,30], (10,20]

Thanks in advance

Return 0 for unseen labels of a Categorical/Nominal Variable in Test Set or in Production

The given below 2 questions can be answered with this Feature Request
Question -1 : In case a categorical variable in the train set does not have a Rare' label based on values in parameters 'tol' and 'n_categories' , how do we handle unseen values in the test set or in production ?

Question-2 : In the absence of a 'Rare' label in a categorical variable in the train set based on values in parameters 'tol' and 'n_categories', is there any similar mechanism in Feature-Engine like 'handle_unknown='ignore'' as used by sklearn OneHotEncoder for any of the categorical encoding methods implemented in Feature-Engine ?

CategoricalVariableImputer

Allow user to select the string with which to replace missing data. At the moment, the only value allowed is the default one == 'missing'

Creation of new feature instead of overwriting the feature

First thanks to the package. It simplifies the overall process. My issue is when using categorical
frequency encoder, why I am not unable to create a new feature instead of overwriting the categorical variable. It will be useful IF I use another categorical encoder function.
Correct me I am wrong if there is feature already in the package.

Code smell

Massive PR:
separate modules with > 500 code lines into relevant modules, all as part of a folder.

Example folder missing, contains multiple scripts for missing data imputers, each containing transformers with similar functionality or one transformer at a time

Draft of utilitiy functions

Hi, as stated in the course here you are the draft of the functions used several times in the course (I tried it and seems to work fine, but I haven't tested it completely):

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# for Q-Q plots
import scipy.stats as stats



def diagnostic_boxplots(df, variable, figsize=None):
    ''' function takes a dataframe (df) 
     the variable of interest as arguments
     and optionally the figsize as tuple, default to (16,4) '''

    # define figure size
    if figsize == None:
        figsize = (16,4)
    plt.figure(figsize=figsize)

    # histogram
    plt.subplot(1, 3, 1)
    sns.distplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()



# function to find upper and lower boundaries
# for normally distributed variables


def find_normal_boundaries(df, variable, distance=3):

    ''' calculate the boundaries outside which sit the outliers
     for a Gaussian distribution given the distance, default 3 '''

    upper_boundary = df[variable].mean() + distance * df[variable].std()
    lower_boundary = df[variable].mean() - distance * df[variable].std()

    return upper_boundary, lower_boundary


# function to find upper and lower boundaries
# for skewed distributed variables


def find_skewed_boundaries(df, variable, distance):

    ''' Let's calculate the boundaries outside which sit the outliers
     for skewed distributions

     distance passed as an argument, gives us the option to
     estimate 1.5 times or 3 times the IQR to calculate
     the boundaries.'''

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary


def diagnostic_plots(df, variable, figsize=None):
    
    ''' function to plot a histogram and a Q-Q plot
    side by side, for a certain variable
    optionally with figsize as tuple
    default to (15,6) '''

    if figsize == None:
        figsize = (15,6)
    plt.figure(figsize=figsize)
    plt.subplot(1, 2, 1)
    df[variable].hist(bins=30)

    plt.subplot(1, 2, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)

    plt.show()


def set_boundaries(df, variable, upper_limit=None, lower_limit = None):
    ''' Set the boundaries, one or both, for the specific variable in the 
    DataFrame (df). '''

    if (upper_limit == None and lower_limit == None):
        return df[variable]
    elif lower_limit == None:
        return np.where(df[variable] > upper_limit, upper_limit, df[variable])
    elif upper_limit == None:
        return np.where(df[variable] < lower_limit, lower_limit, df[variable])
    else:
        return np.where(df[variable] > upper_limit, upper_limit,
                       np.where(df[variable] < lower_limit, lower_limit, df[variable]))


def flag_boundaries(df, variable, upper_limit=None, lower_limit = None):
    ''' Flag the boundaries, one or both, for the specific variable in the 
    DataFrame (df). '''
    if (upper_limit == None and lower_limit == None):
        return False
    elif lower_limit == None:
        return np.where(df[variable] > upper_limit, True, False)   
    elif upper_limit == None:
        return np.where(df[variable] < lower_limit, True, False) 
    else:
        np.where(df[variable] > upper_limit, True,
                       np.where(df[variable] < lower_limit, True, False))




def find_quantile_boundaries(df, variable, lower_quantile=0.05, upper_quantile=0.95):

    ''' Returns  the boundaries as the quantiles, default to 0.05 and 0.95 ''' 

    lower_boundary = df[variable].quantile(lower_quantile)
    upper_boundary = df[variable].quantile(upper_quantile)

    return upper_boundary, lower_boundary

Nothing new but they look convenient to me.

created new features by all categorical features combinations

Is your feature request related to a problem? Please describe.
if we have categorical features how to created new features by all features combinatoric combination
since in real life categorical features are NOT independent , but many of them are dependent from each to others

even scikit learn can not do, but you will?

related to
PacktPublishing/Python-Feature-Engineering-Cookbook#1
Describe the solution you'd like
for example maximum number of combined features is given: or 2 or 4 or 5

for pandas DF you can use
concatenation
https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python

columns = ['whatever', 'columns', 'you', 'choose']
df['period'] = df[columns].astype(str).sum(axis=1)

so three features combinations from 11 features
features combinatoric combination
seems to be 3 nested loops are not good for this
for i in range(1,11)
for j in range(i+1,11)
for k in range(j+1,11)

you need to get 165 new features from all combinations (not permutations )
then you get many new features

"
Another alternative that I've seen from some Kaggle masters is to join the categories in 2 different variables, into a new categorical variable, so for example, if you have the variable gender, with the values female and male, for observations 1 and 2, and the variable colour with the value blue and green for observations 1 and 2 respectively, you could create a 3rd categorical variable called gender-colour, with the values female-blue for observation 1 and male-green for observation 2. Then you would have to apply the encoding methods from section 3 to this new variable
."


yes do this
but it should not be necessary pandas
also you need to think about RAM use, since it will be a lot of new features
before creating new features
think about converting categorical features to "int" types with small amount of digits from numpy ,

improve outlier jupyter demos

Demonstrate each class into a single jupyter notebook (at the moment all classes are together in 1 notebook)

Check if new functionality has been added, and include a demonstration in the notebook.

transform the feature back to original form

While we use the target guided mean encoding we tend to replace the labels by its mean and fit the algo on top of it. So now if we create decision tree it will show the split of features along with its value using mean values.
It will be better if there is inverse transform function in feature engine which can resort back to the original values of the dataframe.

Requesting for new feature addition in the feature engine library for including the inverse transform.

improve imputation notebooks

Separate each missing data imputer into a single notebook, within the imputer folder. Check if new parameters or changes have been done to the imputers, and try to include a demo in the notebook.

Allow outlier removers to operate on data with missing values

Is your feature request related to a problem? Please describe.
A typical way of imputing numerical data is by replacing missing values by the mean. However, if the data has outliers, then this approach can be ineffective. But, the outlier remover classes in feature engine cannot be fit on data that has missing values, meaning that imputation must happen first. It would be great if these classes could be compatible with missing data, so that outliers could be capped before imputation occurs.

Describe the solution you'd like
Allow outlier remover classes (e.g. Windsorizer) to fit and transform on data with missing values. The missing values can be ignored in both cases.

Allow user defined value to replace rare categories

Great package! Thanks for all the work put into it.

Just a small point - In the RareLabelCategoricalEncoder I wondered if it might be nice to allow the user to define a value to replace rare categories with (but still default to 'Rare'). You could perhaps imagine someone wanting all the rare categories subsumed into an existing larger category for example.

add / update contributing docs

The contributing page does not reflect the way in which we are actually working. So we need to re-write it. At the moment we are working with feature branches for each issue and pushing directly to master.

We need to make clear that for code changes, we need to bump the version at each PR, so the new version is automatically uploaded to pypi.

We need to define if we need changes in the version for changes in the docs. I don't think we do, but need to read more about best practices.

code review

Have existing code revised by experienced python developer

add test to determine if WoERatioCategoricalEncoder returns an error when the probability in the denominator is 0

At the moment the test is commented out:
https://github.com/solegalli/feature_engine/blob/master/tests/test_encoding/test_woe_encoder.py#L112

the aim is to test this bit of code here

I think the test should work, and I commented out because I changed the error for a warning. But now, we decided to go back to returning an error.

So in short, the aim is to corroborate that the commented out test checks the intended bit of code and if yes, uncomment it, otherwise, replace it by the suitable test.

include transformer for datetime variables

The first version of this module should include the following:

  • New module: datetime (folder)

Three transformers:

  • ExtractDateFeatures
  • ExtractTimeFeatures
  • ExtractDateTimeFeatures

A base class:

  • DateTimeBaseTransformer

The base class should:

  • check that the variables entered by the user are datetime, or if string / object, transform them to datetime (I wonder if we should make this a function part of variable_manipulation? and call it in this transfomrer? In case we develop the timeseries module, we may need it later for other modules as well.

  • have a method to return time features

  • have a method to return date features

  • methods to check the input dataframe, and options on what to do if outliers are present

The ExtractDateFeatures should derive the following features:

  • month, quarter, semester and year (all numeric outputs)
  • week of the year
  • is week of the month supported by pandas? if yes, when we should return it
  • day (numeric 1-31), day of the week (numeric 0-6), and is_weekend (binary)
  • anything else supported by pandas?
  • anything else that would be useful

The ExtractTimeFeatures should extract the following features

  • hr, minute, second
  • timezone: with parameter return_timezone=False by default, we allow the user to return a time zone categorical feature
  • unify_timezone=False, to handle different timezones. If True, the transformer is timezone aware, unifies to greenwich and then derived features (should we give option to user to unify to something else? probably yes)
  • to discuss: is_morning, is_afternoon, is_evening
  • to discuss: working hrs (I am thinking of passing a string parameter like '9-17' and use those to determine this feature
  • anything else that would be useful

The ExtractDateTimeFeatures is a sum of the previous transformers, so it should return all possible date and time features.

The reason I suggest to break this into 3 classes is because some timestamps contain only dates, some only times and some both. And I think it would be easier if the user, who knows the timestamp selects the appropriate transformer, instead of adding code to understand which type of timestamp it is, and then derive the features

To consider: should we in version 1 of this transformer return all possible features? or should we give the user an option of which features to return? example, the user may want year and month but not quarter and semester.

Example code in recipes 2 to 5 of the this link..

Things to think about for the transformers design:

  • Transformer returns all new variables by default, or only those indicated by the user. This behavior could regulated by a parameter in the init. As per previous question, should we leave this for version 2 of this transformer? or add it straightaway?
  • Transformer should be able to derive features from more than 1 datetime variable at the time, like all feature engine transformers.
  • Option in Transformer to drops original datetime variables (as those are not useful for machine learning), drop_datetime_variables=True or similar.

Files needed in addition to code files:

  • add transformer to readme list
  • add transformer to docs/index.rst
  • add docs folder exclusive for this transformer with rst files with examples
  • add jupyter notebooks showing how to use this transformers

This issue can be done in 1 big PR, or multiple PRs, maybe 1 per transformer.

add style checks

Expand flake8 style checks to entire package. At the moment only checking for base_transformers.py. Make use of pre-commit

Proposal: monotonic and user defined binning

Hi, first of all thank for the great package and all the efforts put into it.
I would like to propose additional algorithms which would in my opinion add value to the package.

  1. Monotonic WoE binning. This one could be pretty useful for logistic regression problems, particularly (but not limited to) in risk scoring and fraud detection models.

  2. User defined binning, where you specify bin boundaries manually.

Basically, WoE binning algorithm has quite a few Python implementations already, and user defined binning is even more straightforward, being just a matter of couple of code lines; however, I believe it would be beneficial to have both algorithms as part of feature_engine, and especially useful to have those in form of standard transformers (in sklearn style) which will allow saving and retrieving them as preprocessing models using pickle or joblib libraries.

fix docstrings inheritance

docstrings in parent classes are not inherited by sphinx, and therefore not shown in the documentation. Need to do a work around to fix this

Warning message

Whenever I use one of your transform function on a training set...

X_train_enc = ohe_enc.transform(X_train)

I get a warning message...

versions/3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py:933: FutureWarning: Passing attributes to check_is_fitted is deprecated and will be removed in 0.23. The attributes argument is ignored.
"argument is ignored.", FutureWarning)

Using a test set, doesn't produce the error.

I'm using version 0.22 of scikit-learn.

Thank you

improve categorical encoding notebooks

Separate each categorical encoder into a single notebook, within the encoders folder.

Check, if new parameters or functionality has been added and include examples within the notebooks.

The formula to calculate WoE in WoERatioCategoricalEncoder may not be correct:

Hello,
Thank you for creating this package. It covers almost all the hot topics in feature engineering.
One thing I noticed that the formula to calculate WoE in WoERatioCategoricalEncoder may not be correct. The percentage of goods should be the count of goods in that category divided by the total counts of goods in the sample space, not the count of goods divided by the total counts of that category. The same calculation as percentage of bads.
for your example in the doc of WoERatioCategoricalEncoder:
the WoE for 'cabin' variable in titanic dataset should be:
{'cabin': {'B':1.629962
'C':0.721704 ,
'D':1.405081 ,
'E':1.405081 ,
'Rare': 0.738745,
'n': -0.357528}
The calculation is below:
cabin n_obs prop_[1] prop_n_obs n_[0] n_[1] prop_n_[0] prop_n_[1] WoE_correct WoE_in_Doc
n 702 0.304843 0.766376 488 214 0.866785 0.606232 -0.357528 -0.824339
C 71 0.56338 0.077511 31 40 0.055062 0.113314 0.721704 0.254892
Rare 37 0.567568 0.040393 16 21 0.028419 0.05949 0.738745 0.271934
D 32 0.71875 0.034934 9 23 0.015986 0.065156 1.405081 0.93827
E 32 0.71875 0.034934 9 23 0.015986 0.065156 1.405081 0.93827
B 42 0.761905 0.045852 10 32 0.017762 0.090652 1.629962 1.163151

RareLabelCategoricalEncoder checks if new data matches shape of the data it was fitted on and throws exception if they do not match

RareLabelCategoricalEncoder checks if new data matches shape of the data it was fitted on and throws exception when the number of columns do not match.

It makes usage of the encoder in the real setup (unseen data) difficult.

A temporal and ugly workaround is to create dummy columns just to match the number of columns in a new data.

The desired behavior is to apply the encoder on the columns that were defined to undergo the transformation and throw exception if these columns do not exist in a new data set.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.