Giter Club home page Giter Club logo

multilabel-oversampling's Introduction

Multilabel Oversampling ๐ŸŒป

Many algorithms for imbalanced data support binary and multiclass classification only. This approach is made for multi-label classification (aka multi-target classification).

๐ŸŽฐ Algorithm

  • Multilabel dataset (as pandas.DataFrame) with imbalanced data
  • Calculate counts per class and then calculate the standard deviation (std) of the count values
  • Do for number_of_adds times the following:
    • Randomly draw a sample from your data and calculate new std
    • If new std reduces, add sample to your dataset
    • If not, draw another sample (to this up to number_of_tries times)
  • A new df is returned.
  • A result plot visualizes the target distribution before and after upsampling. Moreover the counts per index are shown.

โžก๏ธ Usage

import multilabel_oversampling as mo

mo.seed_everything(20)
df = mo.create_fake_data(size=1) # difficult fake dataset with very high dependency of y1 and y2
ml_oversampler = mo.MultilabelOversampler(number_of_adds=100, number_of_tries=100)
df_new = ml_oversampler.fit(df)
#>Start the upsampling process.
#>Iteration:  11%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ                                        | 11/100 [00:00<00:01, 48.43it/s]
#>Iter 11: No improvement after 100 tries.
#>Sampling done.
#>
#>Dataset size original: 20; Upsampled dataset size: 31
#>Original target distribution:  {'y1': 16, 'y2': 12, 'y3': 4, 'y4': 4}
#>Upsampled target distribution: {'y1': 19, 'y2': 12, 'y3': 15, 'y4': 15}

ml_oversampler.plot_all_tries()

Plot from ml_oversampler.plot_all_tries()

ml_oversampler.plot_results()

Plot from ml_oversampler.plot_results()

#import seaborn as sns
#df.style.background_gradient(cmap=sns.color_palette("Spectral", as_cmap=True))

# Original DataFrame
print(df)
#>    y1  y2  y3  y4           x
#>0    1   1   0   0   img_0.jpg
#>1    1   1   0   0   img_1.jpg
#>2    1   1   0   1   img_2.jpg
#>3    1   1   0   0   img_3.jpg
#>4    1   1   1   0   img_4.jpg
#>5    1   1   0   0   img_5.jpg
#>6    1   1   0   0   img_6.jpg
#>7    1   1   0   0   img_7.jpg
#>8    1   1   0   1   img_8.jpg
#>9    1   1   0   0   img_9.jpg
#>10   1   1   0   0  img_10.jpg
#>11   1   1   0   0  img_11.jpg
#>12   1   0   1   0  img_12.jpg
#>13   1   0   1   1  img_13.jpg
#>14   1   0   0   0  img_14.jpg
#>15   1   0   0   0  img_15.jpg
#>16   0   0   0   0  img_16.jpg
#>17   0   0   0   0  img_17.jpg
#>18   0   0   0   0  img_18.jpg
#>19   0   0   1   1  img_19.jpg


# New DataFrame after upsampling
print(df_new)
#>    y1  y2  y3  y4           x
#>0    1   1   0   0   img_0.jpg
#>1    1   1   0   0   img_1.jpg
#>2    1   1   0   1   img_2.jpg
#>3    1   1   0   0   img_3.jpg
#>4    1   1   1   0   img_4.jpg
#>5    1   1   0   0   img_5.jpg
#>6    1   1   0   0   img_6.jpg
#>7    1   1   0   0   img_7.jpg
#>8    1   1   0   1   img_8.jpg
#>9    1   1   0   0   img_9.jpg
#>10   1   1   0   0  img_10.jpg
#>11   1   1   0   0  img_11.jpg
#>12   1   0   1   0  img_12.jpg
#>13   1   0   1   1  img_13.jpg
#>14   1   0   0   0  img_14.jpg
#>15   1   0   0   0  img_15.jpg
#>16   0   0   0   0  img_16.jpg
#>17   0   0   0   0  img_17.jpg
#>18   0   0   0   0  img_18.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>13   1   0   1   1  img_13.jpg
#>13   1   0   1   1  img_13.jpg
#>13   1   0   1   1  img_13.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg
#>19   0   0   1   1  img_19.jpg

โ„น๏ธ Install

  • Install from GitHub (you may need to install dependencies from requirements.txt first)
pip install git+https://github.com/phiyodr/multilabel-oversampling

๐Ÿ‘ท Future work

  • Implement weighted sampling (so that samples which are already often in the new df are less often sampled)

๐ŸŒป

multilabel-oversampling's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

multilabel-oversampling's Issues

Issue with Installing multilabel-oversampling

Hello,

I hope this message finds you well. I recently encountered an issue while trying to install the multilabel-oversampling package directly from the GitHub repository using pip.

Specifically, during the installation process, I received a ModuleNotFoundError indicating that the numpy module was not found. It seems that the package attempts to import numpy during the build/setup phase before numpy is installed, even though numpy is listed as a dependency in install_requires in setup.py.

Here's the error message I received for reference:
File "/private/var/folders/9y/pzqtcdlx1jv2lcr6z_6sq6jm0000gn/T/pip-req-build-ktfxzxk3/multilabel_oversampling/multilabel_oversampling.py", line 1, in
import numpy as np
ModuleNotFoundError: No module named 'numpy'
[end of output]

To resolve the issue, I tried the following:

  • Listed numpy at the beginning of my requirements.txt file to ensure it's installed first.
  • Manually installed numpy and other dependencies before trying to install multilabel-oversampling.
  • Attempted to comment out the numpy imports temporarily to bypass the error.

While I eventually found a workaround, it would be helpful for other users if this issue could be addressed directly in the repository. Perhaps there's a way to ensure that the package doesn't attempt to import dependencies during the build/setup phase or to handle this in a way that doesn't result in an error.

Thank you for your attention to this matter, and I appreciate the work you've put into the multilabel-oversampling package!

Best regards,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.