Giter Club home page Giter Club logo

Comments (8)

gykovacs avatar gykovacs commented on June 11, 2024

Hi,

thank you! I think there are multiple things to consider here. First, you cannot evaluate an oversampler alone, you always need a classifier trained on the oversampled dataset. The evaluation should be done in a cross-validation manner, that is, you repeatedly split the dataset to train and test set, oversample the training set, fit a classifier to the oversampled training set and predict the test set. This can be achieved in a couple of lines of code:

import numpy as np
import smote_variants as sv
import imblearn.datasets as imb_datasets

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

libras= imb_datasets.fetch_datasets()['libras_move']

X= libras['data']
y= libras['target']


classifier= DecisionTreeClassifier(max_depth=3, random_state=5)

aucs= []

# cross-validation
for train, test in StratifiedKFold(random_state=5).split(X, y):
    # splitting
    X_train, X_test= X[train], X[test]
    y_train, y_test= y[train], y[test]
    
    # oversampling
    X_train_samp, y_train_samp= sv.SMOTE(n_neighbors=3, random_state=5).sample(X_train, y_train)
    classifier.fit(X_train_samp, y_train_samp)
    
    # prediction
    y_pred= classifier.predict_proba(X_test)
    
    # evaluation
    aucs.append(roc_auc_score(y_test, y_pred[:,1]))

print('AUC', np.mean(aucs))

You can add any number of further classifier or oversampler to this evaluation loop. One must be careful that all the oversamplers and classifiers should be evaluated on the very same database folds for comparability.

On the other hand, one needs to consider that many oversampling techniques have various parameters that can be tuned. So, it is usually not enough to evaluate SMOTE with one single parameter settings, it needs to be evaluated with many different parameter settings, again, in a cross-validation manner, before one could say that one oversampler works better than another.

Also, classifiers can have a bunch of parameters to tune, thus, in order to carry out a proper comparison, one needs to evaluate oversamplers with many parameter combinations and subsequently applied classifiers with many parameter combinations.

This is the only way to draw fair and valid consequences.

Now, if you foresee this process, it is a decent amount of oversampling and classification jobs to be executed, and each of them needs to be done in a proper cross-validation.

You have basically two options. Option #1 is to extend the sample code above to evaluate many oversamplers with many parameter combinations followed by classifiers with many parameter combinations in each step of the loop, and then unify the results.

Alternatively, option #2, you can use the evaluate_oversamplers function of the package, exactly in the way it is provided in the sample codes, as the evaluate_oversamplers function does exactly what I have outlined above. All the results coming out from the evaluate_oversamplers function are properly cross-validated scores.

Just to emphasize: It is incorrect that the evaluate_oversamplers function samples the entire dataset. It repeatedly samples all the cross-validation folds of the dataset, uses the training set for training and the test set for evaluation.

So, as a summary, just like with most of the machine learning code on github, the oversamplers implemented in smote_variants process and sample all the data you feed them. If you want to do cross-validation by hand, you need to split the dataset in whatever way, just like in the sample code above. Alternatively you can use the built-in evaluation functions and carry out all of this work in one single line of code.

from smote_variants.

shwetashrmgithub avatar shwetashrmgithub commented on June 11, 2024

Thanks a ton! It is really helpful.. I will do the needful combinations of hyperparameter tuning.
But Is it correct that smote_variant package is working only when importing dataset from imblearn.datasets??

from smote_variants.

gykovacs avatar gykovacs commented on June 11, 2024

No problem. No, it is not correct. smote_variants is working with any dataset represented as a matrix of explanatory variables (X) and a vector of corresponding class labels (y).

from smote_variants.

shwetashrmgithub avatar shwetashrmgithub commented on June 11, 2024

ok thanks.. It would be really helpful if you provide the link of implementation of SMOTE from scratch not by any package

from smote_variants.

gykovacs avatar gykovacs commented on June 11, 2024

Well, the point of having packages is exactly to avoid and prevent implementation from scratch. Also, as this is an open source package, you can find the implementation of all the oversampling techniques in it, from scratch. Particularly, the SMOTE algorithm is implemented here:

X_min= X[y == self.minority_label]
# fitting the model
n_neigh= min([len(X_min), self.n_neighbors+1])
nn= NearestNeighbors(n_neighbors= n_neigh, n_jobs= self.n_jobs)
nn.fit(X_min)
dist, ind= nn.kneighbors(X_min)
if num_to_sample == 0:
return X.copy(), y.copy()
# generating samples
base_indices= self.random_state.choice(list(range(len(X_min))), num_to_sample)
neighbor_indices= self.random_state.choice(list(range(1, n_neigh)), num_to_sample)
X_base= X_min[base_indices]
X_neighbor= X_min[ind[base_indices, neighbor_indices]]
samples= X_base + np.multiply(self.random_state.rand(num_to_sample, 1), X_neighbor - X_base)

from smote_variants.

gykovacs avatar gykovacs commented on June 11, 2024

Hi @shwetashrmgithub , can we close this issue?

from smote_variants.

shwetashrmgithub avatar shwetashrmgithub commented on June 11, 2024

yeah sure! @gykovacs but one last thing i tried to map other metrics like f1 score and precision on the first given code by you to compare various oversampling algo. but that's showing some error.. could you pl look into that.. thanks

from smote_variants.

gykovacs avatar gykovacs commented on June 11, 2024

@shwetashrmgithub , well, the evaluation function roc_auc_score is a standard sklearn function, I think any other metrics should work, but care must be taken that other metrics, like sklearn.metrics.f1_score take class labels as arguments and not probability scores, so you need to do it this way:

from sklearn.metrics import f1_score

scores = []

# prediction
y_pred= classifier.predict(X_test)
    
# evaluation
scores.append(f1_score(y_test, y_pred))

from smote_variants.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.