Giter Club home page Giter Club logo

class-imbalance's Introduction

Imbalance Data Problem

In machine learning, we often encounter unbalanced data. For example, in a bank's credit data, 97% of customers can pay their loans on time, while only 3% cannot. If we ignore 3% of customers who cannot pay on time, the accuracy of the model may still be high, but it may bring huge losses to the bank. Therefore, we need appropriate methods to balance the data.

Many research papers provide many techniques including over-sampling and under-sampling, to deal with data imbalance. This repository implements some of those techniques.

Requirements

sklearn
numpy

SMOTE

SMOTE is a synthetic minority over-sampling technique mentioned in N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer's paper SMOTE: Synthetic Minority Over-sampling Technique

Parameters
----------
sample      2D (numpy)array
            minority class samples
            
N           Integer
            amount of SMOTE N%
            
k           Integer
            number of nearest neighbors k
            k <= number of minority class samples
            
Attributes
----------
newIndex    Integer
            keep a count of number of synthetic samples
            initialize as 0
            
synthetic   2D array
            array for synthetic samples
            
neighbors   K-Nearest Neighbors model

The corresponding code is in smote.py.

Example

from smote import Smote
import numpy as np

X = np.array([[1, 0.7], [0.95, 0.76], [0.98, 0.85], [0.95, 0.78], [1.12, 0.81]])
s = Smote(sample=X, N=300, k=3)
s.over_sampling()
print(s.synthetic)

The output will be:

[[0.9688157377661356, 0.7470434369118096], [0.970373970826427, 0.7203406632716296], [0.955180350748186, 0.7209519703266685], [0.95, 0.76], [0.9603507618011522, 0.7093355880188698], [0.95, 0.76], [0.98, 0.85], [0.98, 0.85], [0.9767000397651937, 0.8023105914068543], [0.95, 0.78], [0.95, 0.78], [0.9536226582758756, 0.8380845147770741], [1.025027934535906, 0.8276733346832177], [1.0691988855686414, 0.8064896755773396], [1.0457470065562635, 0.7305641034293823]]

Visualize a Simple Example

Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.

ADASYN

ADASYN is mentioned in Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li's paper ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning

Parameters
-----------
X           2D array
            feature space X
            
Y           array
            label, y is either -1 or 1
            
dth         float in (0,1]
            preset threshold
            maximum tolerated degree of class imbalance ratio
            
b           float in [0, 1]
            desired balance level after generation of the synthetic data
            
K           Integer
            number of nearest neighbors
            
Attributes
----------
ms          Integer
            the number of minority class examples
            
ml          Integer
            the number of majority class examples
            
d           float in n (0, 1]
            degree of class imbalance, d = ms/ml
            
minority    Integer label
            the class label which belong to minority
            
neighbors   K-Nearest Neighbors model

synthetic   2D array
            array for synthetic samples

The corresponding code is in adasyn.py.

Example

from adasyn import Adasyn
X = [[1, 1], [1.3, 1.3], [0.7, 1.2], [1.1, 1.1], [0.95, 1.3], [1.4, 1.4],
     [1.2, 1.5], [1.2, 1.7], [1.2, 1.3], [0.88, 0.9], [0.98, 0.55],
     [0.92, 1.24], [0.8, 1.35], [1, 2], [1.5, 1.5], [0.8, 1.8]]
Y = [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1]
a = Adasyn(X, Y, dth=1, b=1, K=4)
a.sampling()
print(a.synthetic)

The output will be:

[[1.0, 2.0], [0.8733313075577545, 1.8733313075577545], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [1.5, 1.5], [0.9621350795352689, 1.962135079535269], [0.8593405970230131, 1.8593405970230132]]

Visualize a Simple Example

Suppose the blue triangles are majority class data, the green triangles are minority class data. The red dots are the synthetic samples we generated by SMOTE.

References:

class-imbalance's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.