Giter Club home page Giter Club logo

smote_variants's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smote_variants's Issues

Why I get this error when I use smote_variants?

This is my code:

vectorCount = CountVectorizer(tokenizer=tokenize)
X_trainCount = vectorCount.fit_transform(X_train)

tf_transformer = TfidfTransformer(use_idf=False)
tf_transformer.fit(X_trainCount)
X_trainTF = tf_transformer.transform(X_trainCount)

oversampler= sv.MulticlassOversampling(sv.distance_SMOTE())
X_res, y_res = oversampler.sample(X_trainTF,y_train)

and I get this error:

ValueError: provided out is the wrong size for the reduction

sv.MulticlassOversampling error for getattr() function

I was checking the document's example of package. The following example gave me the error 'TypeError: getattr(): attribute name must be string'. Why?

import smote_variants as sv

Import sklearn.datasets as datasets 

dataset= datasets.load_wine() 

oversampler= sv.MulticlassOversampling(sv.distance_SMOTE) 

X_samp, y_samp= oversampler.sample(dataset['data'], dataset['target'])

Question

results= sv.evaluate_oversamplers(datasets= datasets,
samplers= sv.get_n_quickest_oversamplers(10),
classifiers= [knn_classifier, dt_classifier],
cache_path= cache_path,
n_jobs= 1,
max_samp_par_comb= 35)
output:
File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\smote_variants_evaluation.py", line 988, in evaluate_oversamplers
sampling_objs = _cache_samplings(folding,

File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\smote_variants_evaluation.py", line 761, in _cache_samplings
sampling_objs = list(reversed(sorted(sampling_objs, key=key)))

File "C:\Users\Administrator\AppData\Roaming\Python\Python39\site-packages\smote_variants_evaluation.py", line 748, in key
if (isinstance(x.sampler, ADG) or isinstance(x.sampler, AMSCO) or

NameError: name 'ADG' is not defined

provided out is the wrong size for the reduction

Hello,

When i am trying to use smote on text data then it gives me an error provided out is the wrong size for the reduction. Could you please help me out ?

2021-07-26 12:17:30,170:INFO:MulticlassOversampling: Running multiclass oversampling with strategy eq_1_vs_many_successive

ValueError Traceback (most recent call last)
in ()
12
13 # X_samp and y_samp contain the oversampled dataset
---> 14 X_samp, y_samp= oversampler.sample(X,y_train)

6 frames
<array_function internals> in cumsum(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
56
57 try:
---> 58 return bound(*args, **kwds)
59 except TypeError:
60 # A TypeError occurs if the object does have such a method in its

ValueError: provided out is the wrong size for the reduction

overflow encountered in double_scalars

I use kmeans_SMOTE for the images mixing, but it occurs this issue, which showed below.

And I go to see the source code, I found that avg_min_dist**len(X[0])/min_count may be the cause. For an image, len(X[0]) may be a vary large data.(I reshpe an image [1, 3, 380, 380] to [1, 433200], so len(X[0]) is 433200. As an exponent, it is really large.)

My question is: How to use this smote library for images mixing and avoid this issue?

image

GridSearchCV classifier parameters: int vs list

Thank you very much for providing the smote_variants package - an excellent tool!

Seems that the parameters can not be passed as lists. I have a questions regarding parameter tuning - using the logic from the manual one can continue the grid using integers:

oversampler = ('smote_variants', 'MulticlassOversampling',
{'oversampler': 'MWMOTE', 'oversampler_params': {}})

classifier = ('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators':50, 'max_depth': 3, 'min_samples_split': 2})

model= Pipeline([('scale', StandardScaler()), ('clf', sv.classifiers.OversamplingClassifier(oversampler, classifier))])

model

param_grid= {'clf__oversampler':[('smote_variants', 'MWMOTE', {'proportion': 0.5}),
('smote_variants', 'MWMOTE', {'proportion': 1.0}),
('smote_variants', 'MWMOTE', {'proportion': 1.5})],
'clf__classifier':[('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 60}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 1000}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 40}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'n_estimators': 10}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 9}),
('sklearn.ensemble', 'RandomForestClassifier', {'max_depth': 4}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 9}),
('sklearn.ensemble', 'RandomForestClassifier', {'min_samples_split': 5}),
] }

Yet in this case, GridSearchCV will result in only one parameter. Another formulation of the grid would result in having all parameters but only the last values of those, which are most likely not optimal:

param_grid= {'clf__classifier': [('sklearn.ensemble', 'RandomForestClassifier', {
'max_depth': 20, 'max_depth': 7, 'max_depth': 9, 'max_depth': 2},
{'n_estimators': 300, 'n_estimators': 180, 'n_estimators': 25, 'n_estimators': 2},
{'min_samples_split': 3, 'min_samples_split': 19, 'min_samples_split': 2},
{'min_samples_leaf': 3, 'min_samples_leaf': 18, 'min_samples_leaf': 2},
)] }

The parameter requirement is basically a dictionary, but with floats or integers and not lists. Could you please provide additional instructions on passing through the parameters to the grid for fine tuning?

Any kind of hints would be very much appreciated. Thank you in advance!

How to vary the "proportion" parameter - MulticlassOversampling class

Hello, I am trying to vary the proportion parameter in the MulticlassOversampling class:

I tried passing passing through the declaration of an instance of the class, but when executing it it had no effect.

image

I also tried in the declaration of the distance_SMOTE(proportion=0.5) method.

image
This is so, because the MulticlassOversampling class creates examples until it matches the majority class. However, I want to try different variants and not only create up to 100%, but 75 or 50%.
I hope I made myself understood.

Can smote_variants deal with 3_class data?

I use Selection of the best oversampler to deal with 3_class data

`from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import smote_variants as sv
import sklearn.datasets as datasets

dataset= datasets.load_breast_cancer()

dataset= {'data': X_array,
'target': y_array,
'name': 'column_3C'}

classifiers = [('sklearn.neighbors', 'KNeighborsClassifier', {}),
('sklearn.tree', 'DecisionTreeClassifier', {})]

oversamplers = sv.queries.get_all_oversamplers(n_quickest=2)

os_params = sv.queries.generate_parameter_combinations(oversamplers,
n_max_comb=2)

samp_obj and cl_obj contain the oversampling and classifier objects which give the

best performance together

samp_obj, cl_obj= sv.evaluation.model_selection(dataset=dataset,
oversamplers=os_params,
classifiers=classifiers,
validator_params={'n_splits': 2,
'n_repeats': 1},
n_jobs= 5)

training the best techniques using the entire dataset

X_samp, y_samp= samp_obj.sample(dataset['data'],
dataset['target'])
cl_obj.fit(X_samp, y_samp)`

but I get some error, just like that: y_true and y_pred contain different number of classes 3, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true: [0 1 2]
How should I do ?

How do I use it along sklearn's Pipeline?

This is my use case:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)

oversampler = sv.MulticlassOversampling(sv.distance_SMOTE())
X_train_resamp, y_train_resamp = oversampler.sample(X_train, y_train)

...and I get this error:
TypeError: only integer scalar arrays can be converted to a scalar index

That's because my X is a list of text. And I would like to leave it that way as opposed to use a tfidf frequency matrix.

is it possible to use sklearn's Pipeline to avoid the error? If so, how do I integrate it into the following Pipeline?

model = Pipeline([ ('tfidf', TfidfVectorizer(), ('clf', LinearSVC()) ])

Multiclass oversampling for multi-minority problem

Hi @gykovacs

I am having at hand, a multiclass(5 classes) multi-minority(2 minority classes) problem, so I implored the approach you demonstrated in notebook 001_multiclass_oversampling. However, it appears only 1-minority class was oversample. Here's what I get checking the class distributions before and after oversampling.

Oversampling with smote_varinats: Class distributions:
class distributions before smote_variants
class 0 - samples: 181126
class 1 - samples: 5694
class 2 - samples: 78727
class 3 - samples: 113578
class 4 - samples: 5971
Print Enter to continue:...

2020-05-04 19:20:24,822:INFO:distance_SMOTE: Running sampling via ('distance_SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")
class distributions atfer smote_variants
class 0 - samples: 181126
class 1 - samples: 181126
class 2 - samples: 78727
class 3 - samples: 113578
class 4 - samples: 5971
Print Enter to continue:...

I expect the other class class 4 to be resampled as well. What am I missing?

Categorical Variables

Thank you for the great work. How to run the smote variants for categorical variables in the data?

Minimum number of rows in a class

I've been using the ADOMS implementation in this package to balance classes for a while with great results. The other day a colleague asked me what the minimum number of rows a class must have is to reasonably oversample it. I told him that there probably wasn't a magic number and the real question was how representative of the true population the sample of instances in the class were. But the problem stuck with me, and after reading through several papers I haven't seen it dealt with. Does anyone have a rough rule or guideline about what the smallest number of rows in a class you require for oversampling with a SMOTE variant? Thanks!

Comparison of some SMOTE Variants without considering the entire dataset

@gykovacs Great work! I want to compare some of the variants of SMOTE and I follow your Code smote_variants/examples/003_evaluation_one_dataset.ipynb and also looked some examples of your paper, but it has oversampled the entire dataset which should not be the case as it must be validated on testing data only.. Could you please provide me the code how I can compare Auc applying SMOTE, BorderlineSMOTE, AHC, CLuster SMOTE, ADASYN..Thanks in advance

Could I apply this package to the time-series raw data?

Hi, I am doing a project which requires to directly input the time-series sensor data, such as acceleration and angular velocity, to the regression-based deep learning model for predicting a score of movement for each subject.
However, I noticed that there are quite few subjects with a certain range of score, and the accuracy of the model dropped when the score of the subject for testing is in this range.
I have read the documentation of SMOTE and it seems that SMOTE-based algorithm are mainly used for augmenting the features, not time-series raw data.
Is that possible to directly apply the SMOTE-based algorithm to the time-series raw data?
Thank you so much in advance!

Error: Dimension of X_train and y_train is not the same !

I am getting this error when trying to use any sampler from smote_variants, my binary dataset has 30 input features and one output
X_train is ndarray with shape (227845, 30)
y_train is ndarray with shape (227845, 1)

/usr/local/lib/python3.10/dist-packages/smote_variants/oversampling/_mwmote.py in sampling_algorithm(self, X, y)
498 return self.return_copies(X, y, "Sampling is not needed")
499
--> 500 X_min = X[y == self.min_label]
501
502 nn_params= {**self.nn_params}

IndexError: boolean index did not match indexed array along dimension 1; dimension is 30 but corresponding boolean dimension is 1

Here's sample of my code:
X_train, X_test, y_train, y_test = split_data(df, 0.2)
import smote_variants as sv
sampler = sv.MWMOTE()
X_resampled, y_resampled = sampler.sample(X_train, y_train)

AttributeError: module 'smote_variants' has no attribute 'Borderline_SMOTE1'

Hi,
recently I'm learning the SMOTE related methods, and I installed the smote_variants package by pip, and there was no error when installing.

But it showed the attribute error when I tried to use the SMOTE and Borderline_SMOTE methods.
Could you please help me for this problem?

Thank you very much!

OversamplingClassifier does not work with probability-based metrics

I use custom scorer from sklearn, via make_scorer function. It does not work if needs_proba=True, so metrics like ROC AUC, PR AUC are unfeasible to be used with smote_variants.

The error says OversamplingClassifier does not have _classes field, had been the latest version pushed to pip?

How to supress INFO verbose in smote_variants

How to supress INFO verbose in smote_variants?? I am using smote_variants in Jupyter notebook/lab environment and it shows lots of verbage

2020-07-08 16:28:12,711:INFO:SMOTE_ENN: Running sampling via ('SMOTE_ENN', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")
2020-07-08 16:28:12,711:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': <module 'numpy.random' from '/Users/aa/opt/anaconda3/lib/python3.7/site-packages/numpy/random/__init__.py'>}")
2020-07-08 16:28:12,720:INFO:EditedNearestNeighbors: Running noise removal via EditedNearestNeighbors
2020-07-08 16:28:13,544:INFO:SMOTE_ENN: Running sampling via ('SMOTE_ENN', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': None}")
2020-07-08 16:28:13,545:INFO:SMOTE: Running sampling via ('SMOTE', "{'proportion': 1.0, 'n_neighbors': 5, 'n_jobs': 1, 'random_state': <module 'numpy.random' from '/Users/aa/opt/anaconda3/lib/python3.7/site-packages/numpy/random/__init__.py'>}")
...
...
...

Question: Regarding time complexity of Oversamplers and "Noise Filters"

For Scikit Learn some have created tools for demoing latency (model fitting) against error.

The Scitime estimator is useful for some of the algorithms in Scikit-learn but not all

It would be useful to benchmark and measure the time complexity of oversamplers and see which ones are fast (or not) based on size of dataset and log-odds of majority proportions.

Integration with imbalanced-learn

@gykovacs I was wondering if you would be interested in an integration of some of the algorithm in imbalanced-learn. It would be really nice to have more variant in imbalanced-learn and actually use your benchmark to have a better idea of what to include.

I was wondering if it would also make sense to compare other methods (e.g. under-sampling) to have a big picture of what is actually working globally.

evaluation metrics

I want to calculate true positive and true negative in my model and trying to use sklearn.metrics using some techniques mentioned in smote_variants.
how can I uses these metrics with smote_variants? if not can you provide true positive, true negative rate in your module?

from,
Arjun Puri

Implement 'verbose' parameter (feature request)

Hello, I would love to have control over the logging of the oversampling with a 'verbose' (bool) parameter. I find that the logging can end up cluttering the terminal too much, especially for multiclass oversampling.

I forked the repo and tried implementing it to the OverSamplingBase, just to realise that the logger is created at many other levels, so I'm unsure of what would be the cleanest solution

Thanks for this work!

DEAGO : negative values for categorical features inside the data

hello,
I am working with a dataset which contains both categorical & continuous features.
On implementing DEAGO, it output -ve values for categorical features after sampling ??
Could someone let me know, whether DEAGO doen't support cat_features in the dataset ??

How do I use this package with image data?

I want to use this kind package with image data, how do I take input? Would converting it into a numpy array and feeding it work? I am also having issue with the 'Data' and 'Target' labels

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.