kylegilde / kaggle-notebooks Goto Github PK
View Code? Open in Web Editor NEWMy notebooks created on Kaggle
My notebooks created on Kaggle
I came across another problem, maybe related to #1 , but I'm not sure if this is by design or not.
In a slightly different example than the one before, I created a toy dataframe with one column ("c") that has only null values. I want this column to be dropped inside the Column Transformer Pipeline before imputing (because an all-nan column will be silently dropped by SimpleImputer, so in my opinion it is better to have a step that explicitly does it). So the code below:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import feature_importance as fi
data = {
'a': [123, 145, 100, np.NaN, np.NaN, 150],
'b': [10, np.NaN, 30, np.NaN, np.NaN, 20],
'c': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
'd': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
'e': [np.NaN, 'GE', 'US', 'GE', np.NaN, 'UK']
}
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])
numerical_features = ['a', 'b', 'c']
categorical_features = ['e']
drop_features = ['d']
# Deal with numerical columns
numerical_transformer = Pipeline(steps=[
('remnulls', VarianceThreshold(threshold=0.0)), # col c has only NaNs and should be dropped
('imputer', SimpleImputer(strategy='mean')),
])
# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
# Put together the column transformer
column_transformer = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features), # remove all-nan cols before imputing
('cat', categorical_transformer, categorical_features), # impute + one hot encoding
('dropme', 'drop', drop_features), # drop column d
])
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
('column_transformer', column_transformer),
])
# Fit the preprocesser
fitted_pp = preproc.fit(df)
# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape old dataframe: {}'.format(str(df.shape)))
print(df)
# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_selected_features()
print('Shape new dataframe: {}'.format(str(transf_df.shape)))
print(transf_df)
print('Cols in new df according to FeatureImportance: ({}) {}'.format(len(new_cols), new_cols))
returns:
Shape old dataframe: (6, 5)
a b c d e
0 123.0 10.0 NaN Michael NaN
1 145.0 NaN NaN Jessica GE
2 100.0 30.0 NaN Sue US
3 NaN NaN NaN Jake GE
4 NaN NaN NaN Amy NaN
5 150.0 20.0 NaN Tye UK
Shape new dataframe: (6, 6)
[[123. 10. 0. 0. 0. 1. ]
[145. 20. 1. 0. 0. 0. ]
[100. 30. 0. 0. 1. 0. ]
[129.5 20. 1. 0. 0. 0. ]
[129.5 20. 0. 0. 0. 1. ]
[150. 20. 0. 1. 0. 0. ]]
Cols in new df according to FeatureImportance: (7) ['a', 'b', 'c', 'e_GE', 'e_UK', 'e_US', 'e_missing']
So you can see that column c was dropped from the resulting dataframe, but it is still showing in the list of features.
So, my question is, is there a way to have a Pipeline with a Feature Selection step inside a Column Transformer, or at least as a step before the Column Transformer in the outer Pipeline, to avoid the issue of the silent dropping of all-nan columns by the Imputer?
Thanks!
(Not sure if this is the correct place to make comments on your code; if not, please accept my apologies!)
Thank you for this great code!
I'm trying to use it, and I ran across an issue with a column transformer that uses the 'drop' option. For example:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import feature_importance as fi
data = {'a': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
'b': [np.NaN, 'F', np.NaN, 'F', np.NaN, 'M'],
'c': [123, 145, 100, np.NaN, np.NaN, 150],
'd': [10, np.NaN, 30, np.NaN, np.NaN, 20],
'e': [14, np.NaN, 29, np.NaN, 52, 45],
}
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd', 'e'])
numerical_features = ['c', 'd', 'e']
categorical_features = ['a', 'b']
# Deal with categorical columns
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
])
# Put together the column transformer
column_transformer = ColumnTransformer(
transformers=[
('num', 'drop', numerical_features), # drop numerical columns
('cat', categorical_transformer, categorical_features),
])
# Make a complete preprocessing pipeline
preproc = Pipeline(steps=[
('column_transformer', column_transformer),
('variance_threshold', VarianceThreshold(threshold=0.0)),
])
# Fit the preprocesser
fitted_pp = preproc.fit(df)
# Transform the dataset
transf_df = fitted_pp.transform(df)
print('Shape new dataframe: {}'.format(str(transf_df.shape)))
# Use class FeatureImportance to get the names of the new features
feature_importance = fi.FeatureImportance(fitted_pp)
new_cols = feature_importance.get_feature_names()
print('Length new cols: {}'.format(len(new_cols)))
print('New cols: {}'.format(new_cols))
Currently, the above code will output the following:
Shape new dataframe: (6, 9)
Length new cols: 12
New cols: ['c', 'd', 'e', 'a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']
But cols c, d and e (numerical_features) should have been dropped.
If I alter line 91 (https://github.com/kylegilde/Kaggle-Notebooks/blob/master/Extracting-and-Plotting-Scikit-Feature-Names-and-Importances/feature_importance.py#L91) from :
if transformer_name == 'remainder' and transformer == 'drop':
to:
if transformer == 'drop':
then I get the expected result:
Shape new dataframe: (6, 9)
Length new cols: 9
New cols: ['a_Amy', 'a_Jake', 'a_Jessica', 'a_Michael', 'a_Sue', 'a_Tye', 'b_F', 'b_M', 'b_missing']
Not sure how this change affects other functionalities of the class.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.