analyticalmindsltd / smote_variants Goto Github PK

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features

Home Page: http://smote-variants.readthedocs.io

License: MIT License

Python 3.27% Makefile 0.01% Shell 0.01% Jupyter Notebook 96.73%

smote oversampling imbalanced-learning imbalanced-data

smote_variants's People

Stargazers

Watchers

Forkers

bugbreaker serignecisse gisu2km evonnehyf xiaoxuanz114 basurounaq11 robot520 mlwhc hamedmx tianjunli allensmile yyht fatflower sprinterzzj giangzuzana sharadgupta27 asniar140481 xiaoruishan burhr2 guanlanxu wang9511 nikolaospapachristou yachaoshao hirahtang lomoda0715 zxz53000 topolphukhanh oldgittroy zawaruddin baibai25 sofq huangdengrong databill86 srinivasgutta7 396131033 lijfrank pgsrv samuelanyaso handreazz xiangzhao5119 winkids-lu zhanghonglishanzai postyear bob996 max19980806 ankitnigam1985 iqahaziqah zhanggaofeng1120 angelina1996 nioz127 sasmita2014 musicdendy sks-hub hatemgh suvodeep90 dotrado stephanheijl mu123456789123 hy0122endeavor sakethbachu wnov w140601 jung-alen zht-1994 mohammedsabri95 intouchkun andreysfc karapto dengxiongshi fanwangm bosgithub saifsami16 emsnfi yu45020 15754322582 aizulfaiz ashhadulislam msalehi64 bluesky-1222 vataliya hl026 ashoknp-git jingmeiyang somasekhar-nakkala wangxuekui liuguodong0822 augustkrzhu cookie1024 guokailiu erikvalle xinghuaman marcelomdevasconcellos seanigami jachinchen cdchushig mymdebug jerry-jie-xie qqzhimi saiharish97 xiu123abc

smote_variants's Issues

Why I get this error when I use smote_variants?

This is my code:

vectorCount = CountVectorizer(tokenizer=tokenize)
X_trainCount = vectorCount.fit_transform(X_train)

tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(X_trainCount)
X_trainTF = tf_transformer.transform(X_trainCount)

oversampler= sv.MulticlassOversampling(sv.distance_SMOTE())
X_res, y_res = oversampler.sample(X_trainTF,y_train)

and I get this error:

ValueError: provided out is the wrong size for the reduction

sv.MulticlassOversampling error for getattr() function

I was checking the document's example of package. The following example gave me the error 'TypeError: getattr(): attribute name must be string'. Why?

import smote_variants as sv

Import sklearn.datasets as datasets 

dataset= datasets.load_wine() 

oversampler= sv.MulticlassOversampling(sv.distance_SMOTE) 

X_samp, y_samp= oversampler.sample(dataset['data'], dataset['target'])

I want to use in my dataset to apply smote variants but is not working

Question

results= sv.evaluate_oversamplers(datasets= datasets,
samplers= sv.get_n_quickest_oversamplers(10),
classifiers= [knn_classifier, dt_classifier],
cache_path= cache_path,
n_jobs= 1,
max_samp_par_comb= 35)
output:
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\smote_variants_evaluation.py", line 988, in evaluate_oversamplers
sampling_objs = _cache_samplings(folding,

File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\smote_variants_evaluation.py", line 761, in _cache_samplings
sampling_objs = list(reversed(sorted(sampling_objs, key=key)))

File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\smote_variants_evaluation.py", line 748, in key
if (isinstance(x.sampler, ADG) or isinstance(x.sampler, AMSCO) or

NameError: name 'ADG' is not defined

provided out is the wrong size for the reduction

Hello,

When i am trying to use smote on text data then it gives me an error provided out is the wrong size for the reduction. Could you please help me out ?

2021-07-26 12:17:30,170:INFO:MulticlassOversampling: Running multiclass oversampling with strategy eq_1_vs_many_successive

ValueError Traceback (most recent call last)
in ()
12
13 # X_samp and y_samp contain the oversampled dataset
---> 14 X_samp, y_samp= oversampler.sample(X,y_train)

6 frames
<array_function internals> in cumsum(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
56
57 try:
---> 58 return bound(*args, **kwds)
59 except TypeError:
60 # A TypeError occurs if the object does have such a method in its

ValueError: provided out is the wrong size for the reduction

overflow encountered in double_scalars

I use kmeans_SMOTE for the images mixing, but it occurs this issue, which showed below.

And I go to see the source code, I found that avg_min_dist**len(X[0])/min_count may be the cause. For an image, len(X[0]) may be a vary large data.(I reshpe an image [1, 3, 380, 380] to [1, 433200], so len(X[0]) is 433200. As an exponent, it is really large.)

My question is: How to use this smote library for images mixing and avoid this issue?

I want to know about metrics used for measuring in evaluate_oversamplers

I am familiar with AUC: area under curve of ROC, acc: accuracy, F1: f1 score. But bit confused with gacc,brier. will you please provide description about these metrics.

GridSearchCV classifier parameters: int vs list

Thank you very much for providing the smote_variants package - an excellent tool!

Seems that the parameters can not be passed as lists. I have a questions regarding parameter tuning - using the logic from the manual one can continue the grid using integers:

oversampler = ('smote_variants', 'MulticlassOversampling',
{'oversampler': 'MWMOTE', 'oversampler_params': {}})

classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})

model= Pipeline([('scale', StandardScaler()), ('clf', sv.classifiers.OversamplingClassifier(oversampler, classifier))])

model

param_grid= {'clf__oversampler':[('smote_variants', 'MWMOTE', {'proportion': 0.5}),
('smote_variants', 'MWMOTE', {'proportion': 1.0}),
('smote_variants', 'MWMOTE', {'proportion': 1.5})],
'clf__classifier':[('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 60}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 40}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 9}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 4}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 9}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 5}),
] }

Yet in this case, GridSearchCV will result in only one parameter. Another formulation of the grid would result in having all parameters but only the last values of those, which are most likely not optimal:

param_grid= {'clf__classifier': [('sklearn.ensemble', 'RandomForestClassifier', {
'max_depth': 20, 'max_depth': 7, 'max_depth': 9, 'max_depth': 2},
{'n_estimators': 300, 'n_estimators': 180, 'n_estimators': 25, 'n_estimators': 2},
{'min_samples_split': 3, 'min_samples_split': 19, 'min_samples_split': 2},
{'min_samples_leaf': 3, 'min_samples_leaf': 18, 'min_samples_leaf': 2},
)] }

The parameter requirement is basically a dictionary, but with floats or integers and not lists. Could you please provide additional instructions on passing through the parameters to the grid for fine tuning?

Any kind of hints would be very much appreciated. Thank you in advance!

How to vary the "proportion" parameter - MulticlassOversampling class

Hello, I am trying to vary the proportion parameter in the MulticlassOversampling class:

I tried passing passing through the declaration of an instance of the class, but when executing it it had no effect.

I also tried in the declaration of the distance_SMOTE(proportion=0.5) method.

This is so, because the MulticlassOversampling class creates examples until it matches the majority class. However, I want to try different variants and not only create up to 100%, but 75 or 50%.
I hope I made myself understood.

Can smote_variants deal with 3_class data?

I use Selection of the best oversampler to deal with 3_class data

`from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import smote_variants as sv
import sklearn.datasets as datasets

dataset= datasets.load_breast_cancer()

dataset= {'data': X_array,
'target': y_array,
'name': 'column_3C'}

classifiers = [('sklearn.neighbors', 'KNeighborsClassifier', {}),
('sklearn.tree', 'DecisionTreeClassifier', {})]

oversamplers = sv.queries.get_all_oversamplers(n_quickest=2)

os_params = sv.queries.generate_parameter_combinations(oversamplers,
n_max_comb=2)

samp_obj and cl_obj contain the oversampling and classifier objects which give the

best performance together

samp_obj, cl_obj= sv.evaluation.model_selection(dataset=dataset,
oversamplers=os_params,
classifiers=classifiers,
validator_params={'n_splits': 2,
'n_repeats': 1},
n_jobs= 5)

training the best techniques using the entire dataset

X_samp, y_samp= samp_obj.sample(dataset['data'],
dataset['target'])
cl_obj.fit(X_samp, y_samp)`

but I get some error, just like that: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]
How should I do ?

Add links of datasets as comments in every notebook in examples given.

Remove warnings

How do I use it along sklearn's Pipeline?

This is my use case:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)

oversampler = sv.MulticlassOversampling(sv.distance_SMOTE())
X_train_resamp, y_train_resamp = oversampler.sample(X_train, y_train)

...and I get this error:
TypeError: only integer scalar arrays can be converted to a scalar index

That's because my X is a list of text. And I would like to leave it that way as opposed to use a tfidf frequency matrix.

is it possible to use sklearn's Pipeline to avoid the error? If so, how do I integrate it into the following Pipeline?

model = Pipeline([ ('tfidf', TfidfVectorizer(), ('clf', LinearSVC()) ])

Multiclass oversampling for multi-minority problem

Hi @gykovacs

I am having at hand, a multiclass(5 classes) multi-minority(2 minority classes) problem, so I implored the approach you demonstrated in notebook 001_multiclass_oversampling. However, it appears only 1-minority class was oversample. Here's what I get checking the class distributions before and after oversampling.

Oversampling with smote_varinats: Class distributions:
class distributions before smote_variants
class 0 - samples: 181126
class 1 - samples: 5694
class 2 - samples: 78727
class 3 - samples: 113578
class 4 - samples: 5971
Print Enter to continue:...

2020-05-04 19:20:24,822:INFO:distance_SMOTE: Running sampling via ('distance_SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")
class distributions atfer smote_variants
class 0 - samples: 181126
class 1 - samples: 181126
class 2 - samples: 78727
class 3 - samples: 113578
class 4 - samples: 5971
Print Enter to continue:...

I expect the other class class 4 to be resampled as well. What am I missing?

Categorical Variables

Thank you for the great work. How to run the smote variants for categorical variables in the data?

Minimum number of rows in a class

I've been using the ADOMS implementation in this package to balance classes for a while with great results. The other day a colleague asked me what the minimum number of rows a class must have is to reasonably oversample it. I told him that there probably wasn't a magic number and the real question was how representative of the true population the sample of instances in the class were. But the problem stuck with me, and after reading through several papers I haven't seen it dealt with. Does anyone have a rough rule or guideline about what the smallest number of rows in a class you require for oversampling with a SMOTE variant? Thanks!

Support for python 3.11

Do you plan on adding support for python3.11?

Comparison of some SMOTE Variants without considering the entire dataset

@gykovacs Great work! I want to compare some of the variants of SMOTE and I follow your Code smote_variants/examples/003_evaluation_one_dataset.ipynb and also looked some examples of your paper, but it has oversampled the entire dataset which should not be the case as it must be validated on testing data only.. Could you please provide me the code how I can compare Auc applying SMOTE, BorderlineSMOTE, AHC, CLuster SMOTE, ADASYN..Thanks in advance

Could I apply this package to the time-series raw data?

Hi, I am doing a project which requires to directly input the time-series sensor data, such as acceleration and angular velocity, to the regression-based deep learning model for predicting a score of movement for each subject.
However, I noticed that there are quite few subjects with a certain range of score, and the accuracy of the model dropped when the score of the subject for testing is in this range.
I have read the documentation of SMOTE and it seems that SMOTE-based algorithm are mainly used for augmenting the features, not time-series raw data.
Is that possible to directly apply the SMOTE-based algorithm to the time-series raw data?
Thank you so much in advance!

Error: Dimension of X_train and y_train is not the same !

I am getting this error when trying to use any sampler from smote_variants, my binary dataset has 30 input features and one output
X_train is ndarray with shape (227845, 30)
y_train is ndarray with shape (227845, 1)

/usr/local/lib/python3.10/dist-packages/smote_variants/oversampling/_mwmote.py in sampling_algorithm(self, X, y)
498 return self.return_copies(X, y, "Sampling is not needed")
499
--> 500 X_min = X[y == self.min_label]
501
502 nn_params= {**self.nn_params}

IndexError: boolean index did not match indexed array along dimension 1; dimension is 30 but corresponding boolean dimension is 1

Here's sample of my code:
X_train, X_test, y_train, y_test = split_data(df, 0.2)
import smote_variants as sv
sampler = sv.MWMOTE()
X_resampled, y_resampled = sampler.sample(X_train, y_train)

AttributeError: module 'smote_variants' has no attribute 'Borderline_SMOTE1'

Hi,
recently I'm learning the SMOTE related methods, and I installed the smote_variants package by pip, and there was no error when installing.

But it showed the attribute error when I tried to use the SMOTE and Borderline_SMOTE methods.
Could you please help me for this problem?

Thank you very much!

Citation format

Please give the citation information.

OversamplingClassifier does not work with probability-based metrics

I use custom scorer from sklearn, via make_scorer function. It does not work if needs_proba=True, so metrics like ROC AUC, PR AUC are unfeasible to be used with smote_variants.

The error says OversamplingClassifier does not have _classes field, had been the latest version pushed to pip?

How to supress INFO verbose in smote_variants

How to supress INFO verbose in smote_variants?? I am using smote_variants in Jupyter notebook/lab environment and it shows lots of verbage

2020-07-08 16:28:12,711:INFO:SMOTE_ENN: Running sampling via ('SMOTE_ENN', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")
2020-07-08 16:28:12,711:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': <module 'numpy.random' from '/Users/aa/opt/anaconda3/lib/python3.7/site-packages/numpy/random/__init__.py'>}")
2020-07-08 16:28:12,720:INFO:EditedNearestNeighbors: Running noise removal via EditedNearestNeighbors
2020-07-08 16:28:13,544:INFO:SMOTE_ENN: Running sampling via ('SMOTE_ENN', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")
2020-07-08 16:28:13,545:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': <module 'numpy.random' from '/Users/aa/opt/anaconda3/lib/python3.7/site-packages/numpy/random/__init__.py'>}")
...
...
...

Question: Regarding time complexity of Oversamplers and "Noise Filters"

For Scikit Learn some have created tools for demoing latency (model fitting) against error.

The Scitime estimator is useful for some of the algorithms in Scikit-learn but not all

It would be useful to benchmark and measure the time complexity of oversamplers and see which ones are fast (or not) based on size of dataset and log-odds of majority proportions.

Integration with imbalanced-learn

@gykovacs I was wondering if you would be interested in an integration of some of the algorithm in imbalanced-learn. It would be really nice to have more variant in imbalanced-learn and actually use your benchmark to have a better idea of what to include.

I was wondering if it would also make sense to compare other methods (e.g. under-sampling) to have a big picture of what is actually working globally.

evaluation metrics

I want to calculate true positive and true negative in my model and trying to use sklearn.metrics using some techniques mentioned in smote_variants.
how can I uses these metrics with smote_variants? if not can you provide true positive, true negative rate in your module?

from,
Arjun Puri

Implement 'verbose' parameter (feature request)

Hello, I would love to have control over the logging of the oversampling with a 'verbose' (bool) parameter. I find that the logging can end up cluttering the terminal too much, especially for multiclass oversampling.

I forked the repo and tried implementing it to the OverSamplingBase, just to realise that the logger is created at many other levels, so I'm unsure of what would be the cleanest solution

Thanks for this work!

Question: Combining these with Undersampling

SMOTE variants can be used with Undersamplers to speed up classification of imbalanced datasets. However oversampling normally precedes undersampling. Is it possible to generate minority samples that are less than the majority?
scikit-learn-contrib/imbalanced-learn#925

DEAGO : negative values for categorical features inside the data

hello,
I am working with a dataset which contains both categorical & continuous features.
On implementing DEAGO, it output -ve values for categorical features after sampling ??
Could someone let me know, whether DEAGO doen't support cat_features in the dataset ??

when use SOMO,Why did the two types of samples not reach a balance and the number did not change

how smote_variants work with incremental classifier with large amount of data

dear,
presently I am working with large datasets with high dimensional (1459 features and 20 billion instances and using partial_fit method to execute my code. how could I use smote_variant library work properly with these classifier (known as online classifier like class sklearn.linear_model.SGDClassifier).

How do I use this package with image data?

I want to use this kind package with image data, how do I take input? Would converting it into a numpy array and feeding it work? I am also having issue with the 'Data' and 'Target' labels