Giter Club home page Giter Club logo

imbalance's Introduction

imbalance

Build Status minimal R version CRAN_Status_Badge packageversion

imbalance provides a set of tools to work with imbalanced datasets: novel oversampling algorithms, filtering of instances and evaluation of synthetic instances.

Installation

You can install imbalance from Github with:

# install.packages("devtools")
devtools::install_github("ncordon/imbalance")

Examples

Run pdfos algorithm on newthyroid1 imbalanced dataset and plot a comparison between attributes.

library("imbalance")
data(newthyroid1)

newSamples <- pdfos(newthyroid1, numInstances = 80)
# Join new samples with old imbalanced dataset
newDataset <- rbind(newthyroid1, newSamples)
# Plot a visual comparison between both datasets
plotComparison(newthyroid1, newDataset, attrs = names(newthyroid1)[1:3], cols = 2, classAttr = "Class")

After filtering examples with neater:

filteredSamples <- neater(newthyroid1, newSamples, iterations = 500)
#> [1] "12 samples filtered by NEATER"
filteredNewDataset <- rbind(newthyroid1, filteredSamples)
plotComparison(newthyroid1, filteredNewDataset, attrs = names(newthyroid1)[1:3])

Execute method ADASYN using the wrapper provided by the package, comparing imbalance ratios of the dataset before and after oversampling:

imbalanceRatio(glass0)
#> [1] 0.4861111
newDataset <- oversample(glass0, method = "ADASYN")
imbalanceRatio(newDataset)
#> [1] 0.9722222

imbalance's People

Contributors

ncordon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

imbalance's Issues

rwo() does not support non-numeric factors properly

The documentation of the rwo function claims it can handle every type of dataset. Indeed, the 2nd part of the method adds noise to numeric features and samples from the existing values for all other types of features. However, it can fail earlier because it attempts to convert all features/columns to numeric (which IMO is not necessary in this algorithm, but probably just copy-pasted, as it also exists at the beginning of the other oversampling methods [and makes sense there]). For example, if we have a factor which has non-numeric levels, rwo throws an error when executing dataset <- toNumeric(dataset, exclude = classAttr).

# Does work
imbalance::rwo(data.frame(test1 = rnorm(10), test2 = rnorm(10), class = factor(sample(c("a", "b"), 10, T))), numInstances = 5, classAttr = "class")
# Error
imbalance::rwo(data.frame(test1 = rnorm(10), test2 = factor(sample(c("a", "b"), 10, T)), class = factor(sample(c("a", "b"), 10, T))), numInstances = 5, classAttr = "class")

oversample() gives an error if ratio >= 0.98 for SMOTE

Sample code below to illustrate the issue. Documentation says that ratio can be between 0 and 1. However, oversample() gives an error if the ratio specified is really close to 1, and also error for ratio = 1. How can a user work around this?

newDataset <- oversample(glass0, ratio = 0.9, method = "SMOTE") #No problem
newDataset <- oversample(glass0, ratio = 1, method = "SMOTE")
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
newDataset <- oversample(glass0, ratio = 1.0, method = "SMOTE")
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
newDataset <- oversample(glass0, ratio = 0.95, method = "SMOTE")
newDataset <- oversample(glass0, ratio = 0.99, method = "SMOTE")
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
newDataset <- oversample(glass0, ratio = 0.98, method = "SMOTE")
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
newDataset <- oversample(glass0, ratio = 0.97, method = "SMOTE") #No error for ratio = 0.97

Improve bandwidth parameter search in PDFOS

Currently a naive mechanism is being used, provided that we know that best bandwidth is going to be O(Silverman's rule of thumb).

  for(double v = 0.25; v < 1.5; v = v + 0.05){
    possible_bwidth.push_back(v * silverman_bandwidth);
  }

Find a better way to adapt the parameter. Maybe a simulated annealing (?).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.