Comments (3)
For numerous oversampling techniques it is certainly possible. There are a bunch of oversampling algorithms, e.g. SMOTE_PSO, etc., which do optimize the number of samples being generated. With these techniques it is up to the algorithm how many minority samples will be generated in the end. However, in many cases one can set the number of samples to be generated through the proportion
parameter of the oversampling class.
Namely, let N_min and N_maj denote the number of minority and majority samples, thus, the difference is N_maj - N_min. The proportion parameter specifies the number of samples to be generated in terms of this difference. Particularly, proportion * (N_maj - N_min)
samples will be generated. For example, if proportion
is set to 1, then the class label distribution will be equalized as the number of minority samples will match the number of majority samples after oversampling. If proportion is set to less than 1, then less samples are generated.
If you want to generate a certain number of samples, for example, 10 additional minority samples are desired, then you can set the proportion parameter to 10 / (N_maj - N_min)
.
The proportion
parameter is supported by more than 60 oversampling techniques in the smote-variants
package.
from smote_variants.
Sorry for asking, but in the API doc it is not noted which of the 80+ oversampling techniques supports proportion
.
- Only noted it can be used https://smote-variants.readthedocs.io/en/latest/examples.html
- Does not note what parameter is supported https://smote-variants.readthedocs.io/en/latest/oversamplers.html
- Lack of notes for combining oversamplers and filters https://smote-variants.readthedocs.io/en/latest/noise_filters.html
If you want to generate a certain number of samples, for example, 10 additional minority samples are desired, then you can set the proportion parameter to 10 / (N_maj - N_min).
What if I want to 5x or 10x minority samples before using an undersampler?
from smote_variants.
That's a good point. I have just created a release (0.7.1) with an additional query function 'get_proportion_oversamplers' to get all oversampler classes with proportion
parameters:
import smote_variants as sv
prop_oversamplers = sv.get_proportion_oversamplers() # list of all oversampler classes with proportion parameters
Also please note that despite having a proportion
parameter, it might be inaccurate as some oversampling techniques change the number of majority samples (e.g. by noise filtering). Those which use proportion
accurately (do not change the majority samples) are exactly the ones which are suitable for multiclass oversampling. You can query these by
import smote_variants as sv
extensive_oversamplers = sv.get_multiclass_oversamplers() # the list of all oversamplers having a proportion parameter and only extending the set of minority samples (leaving the majority samples intact)
Regarding the combination of oversamplers and filters, it is completely up to the user how he combines them. There are some oversampling techniques which inherently contain some noise filter (like SMOTE_TomekLinks
). As these noise filters are used in multiple oversampling techniques, they have been put into a separate module for the ease of reuse. However, one can use these prior to oversampling or on the result of oversampling without any restriction, any pipeline of noise filters and oversampling techniques can be constructed.
To generate minority samples, say, M times the original N_min, one needs to set the proportion parameter to M*N_min/(N_maj - N_min).
from smote_variants.
Related Issues (20)
- Citation format HOT 4
- DEAGO : negative values for categorical features inside the data HOT 3
- Minimum number of rows in a class HOT 1
- when use SOMO,Why did the two types of samples not reach a balance and the number did not change HOT 2
- provided out is the wrong size for the reduction
- Categorical Variables HOT 1
- How to vary the "proportion" parameter - MulticlassOversampling class
- Why I get this error when I use smote_variants? HOT 9
- Could I apply this package to the time-series raw data?
- Question HOT 2
- Question: Regarding time complexity of Oversamplers and "Noise Filters" HOT 1
- GridSearchCV classifier parameters: int vs list HOT 3
- Implement 'verbose' parameter (feature request) HOT 2
- sv.MulticlassOversampling error for getattr() function HOT 2
- Error: Dimension of X_train and y_train is not the same ! HOT 2
- OversamplingClassifier does not work with probability-based metrics HOT 3
- Support for python 3.11 HOT 1
- Remove warnings
- Can smote_variants deal with 3_class data?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smote_variants.