kylegilde / kaggle-notebooks Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 3.57 MB

My notebooks created on Kaggle

Jupyter Notebook 99.43% Python 0.57%

kaggle-notebooks's People

Contributors

Stargazers

Watchers

Forkers

carvalhomb rambam613

kaggle-notebooks's Issues

Allowing Feature Selection inside or before Column Transformer

I came across another problem, maybe related to #1 , but I'm not sure if this is by design or not.

In a slightly different example than the one before, I created a toy dataframe with one column ("c") that has only null values. I want this column to be dropped inside the Column Transformer Pipeline before imputing (because an all-nan column will be silently dropped by SimpleImputer, so in my opinion it is better to have a step that explicitly does it). So the code below:

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {
        'a': [123, 145, 100, np.NaN, np.NaN, 150],
        'b': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
        'd': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'e': [np.NaN, 'GE', 'US', 'GE', np.NaN, 'UK']
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['a', 'b', 'c']
categorical_features = ['e']
drop_features = ['d']

# Deal with numerical columns
numerical_transformer = Pipeline(steps=[
    ('remnulls', VarianceThreshold(threshold=0.0)),  # col c has only NaNs and should be dropped
    ('imputer', SimpleImputer(strategy='mean')),
])

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),  # remove all-nan cols before imputing
        ('cat', categorical_transformer, categorical_features), # impute + one hot encoding
        ('dropme', 'drop', drop_features),  # drop column d
    ])

# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape old dataframe: {}'.format(str(df.shape)))
print(df)

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_selected_features()

print('Shape new dataframe: {}'.format(str(transf_df.shape)))
print(transf_df)
print('Cols in new df according to FeatureImportance: ({}) {}'.format(len(new_cols), new_cols))

returns:

Shape old dataframe: (6, 5)
       a     b   c        d    e
0  123.0  10.0 NaN  Michael  NaN
1  145.0   NaN NaN  Jessica   GE
2  100.0  30.0 NaN      Sue   US
3    NaN   NaN NaN     Jake   GE
4    NaN   NaN NaN      Amy  NaN
5  150.0  20.0 NaN      Tye   UK
Shape new dataframe: (6, 6)
[[123.   10.    0.    0.    0.    1. ]
 [145.   20.    1.    0.    0.    0. ]
 [100.   30.    0.    0.    1.    0. ]
 [129.5  20.    1.    0.    0.    0. ]
 [129.5  20.    0.    0.    0.    1. ]
 [150.   20.    0.    1.    0.    0. ]]
Cols in new df according to FeatureImportance: (7) ['a', 'b', 'c', 'e_GE', 'e_UK', 'e_US', 'e_missing']

So you can see that column c was dropped from the resulting dataframe, but it is still showing in the list of features.

So, my question is, is there a way to have a Pipeline with a Feature Selection step inside a Column Transformer, or at least as a step before the Column Transformer in the outer Pipeline, to avoid the issue of the silent dropping of all-nan columns by the Imputer?

Thanks!

Problem with 'drop' column transformer

(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:


import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import feature_importance as fi

data = {'a': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'b': [np.NaN, 'F', np.NaN, 'F', np.NaN, 'M'],
        'c': [123, 145, 100, np.NaN, np.NaN, 150],
        'd': [10, np.NaN, 30, np.NaN, np.NaN, 20],
        'e': [14, np.NaN, 29, np.NaN, 52, 45],
        }
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])

numerical_features = ['c', 'd', 'e']
categorical_features = ['a', 'b']

# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

# Put together the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('num', 'drop', numerical_features),  # drop numerical columns
        ('cat', categorical_transformer, categorical_features),
    ])
    
    
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
    ('column_transformer', column_transformer),
    ('variance_threshold', VarianceThreshold(threshold=0.0)),
])

# Fit the preprocesser
fitted_pp = preproc.fit(df)

# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape new dataframe: {}'.format(str(transf_df.shape)))

# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)

new_cols = feature_importance.get_feature_names()
print('Length new cols: {}'.format(len(new_cols)))
print('New cols: {}'.format(new_cols))

Currently, the above code will output the following:

Shape new dataframe: (6, 9)
Length new cols: 12
New cols: ['c', 'd', 'e', 'a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

But cols c, d and e (numerical_features) should have been dropped.

If I alter line 91 (https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L91) from :

if transformer_name == 'remainder' and transformer == 'drop':

to:

if transformer == 'drop':

then I get the expected result:

Shape new dataframe: (6, 9)
Length new cols: 9
New cols: ['a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']

Not sure how this change affects other functionalities of the class.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.