Giter Club home page Giter Club logo

kaggle-notebooks's People

Contributors

kylegilde avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

kaggle-notebooks's Issues

Allowing Feature Selection inside or before Column Transformer

I came across another problem, maybe related to #1 , but I'm not sure if this is by design or not.

In a slightly different example than the one before, I created a toy dataframe with one column ("c") that has only null values. I want this column to be dropped inside the Column Transformer Pipeline before imputing (because an all-nan column will be silently dropped by SimpleImputer, so in my opinion it is better to have a step that explicitly does it). So the code below:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {
        'a': [123, 145, 100, np.NaN, np.NaN, 150],
        'b': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
        'd': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'e': [np.NaN, 'GE', 'US', 'GE', np.NaN, 'UK']
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['a', 'b', 'c']
categorical_features = ['e']
drop_features = ['d']

# Deal with numerical columns
numerical_transformer = Pipeline(steps=[
    ('remnulls', VarianceThreshold(threshold=0.0)),  # col c has only NaNs and should be dropped
    ('imputer', SimpleImputer(strategy='mean')),
])

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),  # remove all-nan cols before imputing
        ('cat', categorical_transformer, categorical_features), # impute + one hot encoding
        ('dropme', 'drop', drop_features),  # drop column d
    ])

# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape old dataframe: {}'.format(str(df.shape)))
print(df)

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_selected_features()

print('Shape new dataframe: {}'.format(str(transf_df.shape)))
print(transf_df)
print('Cols in new df according to FeatureImportance: ({}) {}'.format(len(new_cols), new_cols))

returns:

Shape old dataframe: (6, 5)
       a     b   c        d    e
0  123.0  10.0 NaN  Michael  NaN
1  145.0   NaN NaN  Jessica   GE
2  100.0  30.0 NaN      Sue   US
3    NaN   NaN NaN     Jake   GE
4    NaN   NaN NaN      Amy  NaN
5  150.0  20.0 NaN      Tye   UK
Shape new dataframe: (6, 6)
[[123.   10.    0.    0.    0.    1. ]
 [145.   20.    1.    0.    0.    0. ]
 [100.   30.    0.    0.    1.    0. ]
 [129.5  20.    1.    0.    0.    0. ]
 [129.5  20.    0.    0.    0.    1. ]
 [150.   20.    0.    1.    0.    0. ]]
Cols in new df according to FeatureImportance: (7) ['a', 'b', 'c', 'e_GE', 'e_UK', 'e_US', 'e_missing']

So you can see that column c was dropped from the resulting dataframe, but it is still showing in the list of features.

So, my question is, is there a way to have a Pipeline with a Feature Selection step inside a Column Transformer, or at least as a step before the Column Transformer in the outer Pipeline, to avoid the issue of the silent dropping of all-nan columns by the Imputer?

Thanks!

Problem with 'drop' column transformer

(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:


import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {'a': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'b': [np.NaN, 'F', np.NaN, 'F', np.NaN, 'M'],
        'c': [123, 145, 100, np.NaN, np.NaN, 150],
        'd': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'e': [14, np.NaN, 29, np.NaN, 52, 45],
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['c', 'd', 'e']
categorical_features = ['a', 'b']

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', 'drop', numerical_features),  # drop numerical columns
        ('cat', categorical_transformer, categorical_features),
    ])
    
    
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
    ('variance_threshold', VarianceThreshold(threshold=0.0)),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape new dataframe: {}'.format(str(transf_df.shape)))

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)

new_cols = feature_importance.get_feature_names()
print('Length new cols: {}'.format(len(new_cols)))
print('New cols: {}'.format(new_cols))

Currently, the above code will output the following:

Shape new dataframe: (6, 9)
Length new cols: 12
New cols: ['c', 'd', 'e', 'a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

But cols c, d and e (numerical_features) should have been dropped.

If I alter line 91 (https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L91) from :

if transformer_name == 'remainder' and transformer == 'drop':

to:

if transformer == 'drop':

then I get the expected result:

Shape new dataframe: (6, 9)
Length new cols: 9
New cols: ['a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

Not sure how this change affects other functionalities of the class.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.