Lazy-cleaner:

What it is about :

Lazy-cleaner is a python package that cleans your dataframe and that includes correcting data types and class labeling techniques and filling the Null values with more than one method saving the time of coding exeplicit methods.

Advanteges

saving time.
preprocessing data like a pro.
user friendly for begineer Data learners.

Installation Guide

To install Lazy-cleaner Package:

Make sure python3 and pip is installed in your machine.
Use the command line and type: pip install Lazy-cleaner or pip3 install Lazy-cleaner

How to Use?

The most simple format of using Lazy-cleaner is as follow:

from Lazy_cleaner.Lazy_cleaner import stat
from Lazy_cleaner.Lazy_cleaner import fillnan
from Lazy_cleaner.Lazy_cleaner import label
from Lazy_cleaner.Lazy_cleaner import Clean_Data

import pandas as pd

Data = pd.read_csv("FileName.csv")
fill = fillnan()

New_Data = fill.simpleimputer(Data)

Existing functions:

Lazy_cleaner.fill_lin_rand( messy_df, metric , colnames ):

fill the Nan values by Linear Regression or by Random Forest.

also consider filling by median manually before using this method

messy_df -> data frame you want to work on.
metric -> model you want to work with (Linear Reg , Random forest).
colnames-> list of column names that contains Null vlaues.

Returns : the cleaned dataframe after filling the Null values and a list of the missing values

Lazy_cleaner.stat().p_val_of_features( df , label_column ):

calculate Ordinary Least Squares to know which features are important and make overview of the regression model.

df -> the dataframe.
label_column -> the Target column Name in string format.

Returns : summary report about the features and their p value

Lazy_cleaner.stat().na_per_data( df ):

calculate the percentage of Nan values in the whole dataset

df -> dataframe

returns: the percentage of Nan values in the whole dataset

Lazy_cleaner.fillnan().simpleimputer( df ):

this function fill the Null values with simple techniques like (mean,mode) depending on the data type of the column

also consider filling by median manually before using this method

df -> dataframe

returns: the dataframe after editing

Lazy_cleaner.fillnan().hyperimputer( df , metric = "Linear Regression" ):

this function compines between the simple imputation , the linear regression imputation and random forests imputation as in object columns it will impute with the mode and for any numerical number it will impute with linear regression or random forest

also consider filling by median manually before using this method

df->dataframe
metric ->"Linear Regression" or "Random Forests" models to fill you numeric nan values with

returns: the dataframe after editing

Lazy_cleaner.fillnan().fill_by_nex(df,columns=[]):

fill nan values by the next value

df -> dataframe
columns -> a list of columns you want to fill

returns: the dataframe after editing

Lazy_cleaner.fillnan().fill_by_perv(df,columns=[]):

this one fill nan values by the previous value

df -> dataframe
columns -> a list of columns you want to fill

returns: the dataframe after editing

Lazy_cleaner.label().to_category( df ):

change from and object datatype column into categorie datatype column

df-> dataframe

returns: the dataframe after editing

Lazy_cleaner.label().freq_labeling( df , column ):

replace objects by how frequent they are in a certain column

df -> dataframe
column -> column you want to apply this method on

returns: the dataframe after editing

Lazy_cleaner.Clean_Data().reduce_mem_usage( props ):

reduce memory usage of the dataframe.

props-> dataset you want to reduce

returns: the dataframe after editing and Nan values list

Lazy_cleaner.Clean_Data().replace_matches( df , column , string_to_match , min_ratio ):

if there is simillar words but different format due to a data entry you can apply this function to calculate similarity percentage and then change them

df -> dataframe
column -> column to edit
string_to_match -> string you want to match and replace
min_ratio -> minimum probability to excange

returns: the dataframe after editing

Lazy_cleaner.Clean_Data().drop_missing( df , thresh=55 ):

drop the columns if the missing values exceed 60% of it

df-> dataframe
thresh->percentage of the missing threshold to Delete above

returns: the dataframe after editing

Lazy_cleaner.Clean_Data().log_features( df=None ):

apply Log function to the features.

df -> dataframe you want to apply log func to

returns: dataset after applying log

Lazy_cleaner.Clean_Data().dealing_with_outliers( df , type_o = "z-score"):

removes outliers with two techniques z-score and IQR ("Inter-Quartile Range").

df -> dataframe
type_o -> type of the method you want to choose

Returns : the dataframe after removing the outliers

Lazy_cleaner.Clean_Data().normalize( df , columns ):

normalize columns by your choice in a dataframe

df -> dataframe
columns -> list of columns to normalize its values

returns: the dataframe after editing

Lazy_cleaner.Clean_Data().en_de( decode , typeo ):

encode and decode any string format

decode -> if you want to encode assign as False if decode assign True
typeo -> the format you want to encode to or decode from

returns: the result after encoding or decoding

ahmed-ashraf-khalil / lazy-cleaner Goto Github PK

lazy-cleaner's Introduction

Lazy-cleaner:

What it is about :

Advanteges

Installation Guide

How to Use?

Existing functions:

Lazy_cleaner.fill_lin_rand( messy_df, metric , colnames ):

Lazy_cleaner.stat().p_val_of_features( df , label_column ):

Lazy_cleaner.stat().na_per_data( df ):

Lazy_cleaner.fillnan().simpleimputer( df ):

Lazy_cleaner.fillnan().hyperimputer( df , metric = "Linear Regression" ):

Lazy_cleaner.fillnan().fill_by_nex(df,columns=[]):

Lazy_cleaner.fillnan().fill_by_perv(df,columns=[]):

Lazy_cleaner.label().to_category( df ):

Lazy_cleaner.label().freq_labeling( df , column ):

Lazy_cleaner.Clean_Data().reduce_mem_usage( props ):

Lazy_cleaner.Clean_Data().replace_matches( df , column , string_to_match , min_ratio ):

Lazy_cleaner.Clean_Data().drop_missing( df , thresh=55 ):

Lazy_cleaner.Clean_Data().log_features( df=None ):

Lazy_cleaner.Clean_Data().dealing_with_outliers( df , type_o = "z-score"):

Lazy_cleaner.Clean_Data().normalize( df , columns ):

Lazy_cleaner.Clean_Data().en_de( decode , typeo ):

lazy-cleaner's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org