Giter Club home page Giter Club logo

infoimputer's Introduction

InfoImputer

In the field of data science, one of the most common challenges is preparing datasets for machine learning algorithms. Dealing with missing values in the dataset is a critical aspect of this process. To address this challenge, data scientists have developed various imputation techniques that aim to accurately fill these missing values.

Among the popular imputers are:

SimpleImputer: This imputer fills missing values in the data using statistical properties such as mean, median, or the most frequent value.

KNNImputer: The KNNImputer completes missing values by utilizing the k-nearest neighbors algorithm.

IterativeImputer: This imputer estimates each feature from all the other features in an iterative manner.

Introducing InfoImputer:

It is seems that it performs better than the aforementioned imputers. It is similar in nature to the IterativeImputer but comes with some notable differences:

Handling uncorrelated features: The IterativeImputer uses a hyperparameter called n_nearest_features, which determines the number of other features used to estimate missing values for each feature column. However, using all other columns to estimate the target feature may lead to weak predictions and slower processing, especially when the features are uncorrelated. In contrast, InfoImputer has two different approaches: one sets an absolute correlation coefficient threshold to select only the most relevant features for estimatio and the other consider the n most informative features for the specific feature. These ensure a more effective and efficient imputation process.

Separate estimators for classification and regression: The IterativeImputer uses a single estimator for both categorical and numerical columns. However, InfoImputer recognizes the different nature of classification and regression tasks and employs separate estimators for each type. This tailored approach leads to more accurate imputed values.

Automated conversion of categorical values: In the IterativeImputer, converting categorical values to numeric format needs to be done manually. InfoImputer automates this process by factorizing categorical values into numeric representations. This simplifies the imputation workflow, particularly when dealing with categorical data.

By addressing these issues, InfoImputer offers an improved approach to handling missing values in datasets. It takes into account the correlation and mutual information score between features, utilizes separate estimators for classification and regression tasks, and automates the conversion of categorical values to numeric representations.

The Main Motivation

As I showed in the following notebook, correlation can only find linear dependency between random variables where mutual information can also detect nonlinear relations and dependencies.

https://www.kaggle.com/code/khashayarrahimi94/why-you-should-not-use-correlation

Therefore, besides some automation and ease of use in this imputer, I add mutual information score as a criteria for selecting dependece and informational features for the features with missing values that we want to fill them.

Install

pip install Info-Imputer

Example

import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingRegressor
import InfoImputer
from InfoImputer.Auto import Imputer
#import the data in pandas format
data = pd.read_csv(r"your directory")

#if you want to use the correlation coefficient threshold (here threshold = 0.1):
FilledData = Imputer(data,TargetName,0.1,GradientBoostingRegressor,ExtraTreesClassifier)

#if you want to use N most informative features using mutual information (here N = 3)
FilledData = Imputer(data,TargetName,3,GradientBoostingRegressor,ExtraTreesClassifier)

infoimputer's People

Contributors

khashayarrahimi avatar

Stargazers

Mohamad Hoseini avatar motahare shokri avatar Alpay Abbaszade avatar Ravi Ramakrishnan avatar Ashkan Forootan avatar Ansh Tanwar avatar zahra avatar Amirreza Lotfi avatar Sahar avatar Pritam Das avatar Pushkar Ambastha avatar Yassir acharki avatar Wendyellé Abubakrh Alban NYANTUDRE avatar Ahmadali Jamali avatar Arash Arouni avatar Maral.Karbaschi avatar  avatar Sadik Al Jarif avatar Faegheh ghofrani avatar Ali Ejlalzadeh avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.