Giter Club home page Giter Club logo

Comments (10)

gykovacs avatar gykovacs commented on June 4, 2024 3

Hi @glemaitre, absolutely, I was planning for a long time to contact you about this, but you were faster.

I think imbalanced-learn is a fairly mature package, so we definitely shouldn't make smote-variants a dependency of imbalanced-learn, rather, we should select some techniques and translate the codes or reimplement them following the super high-quality standards of imbalanced-learn. In my benchmarking, I have arrived to 6 methods which finish in the top 3 places on various types of datasets, I think these 6 should prove useful in various applications: polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, Lee, SMOBD, G-SMOTE. Alternatively, shooting for the top 3, we could go for polynom-fit-SMOTE, ProWSyn and SMOTE-IPF.

I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.

I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.

Any comments are welcome!

from smote_variants.

glemaitre avatar glemaitre commented on June 4, 2024

We are absolutely on the same line

I have arrived in 6 methods which finish in the top 3 places on various types of datasets

I think that this is the way to go.

On our side, I think that we can become more conservative for including new SMOTE variants. We can first make implement them in smote_variant, if not already present, and use the benchmark for inclusion. It will help us a lot on the documentation side, justifying the included models and the way they work. We can always refer to smote_variant for people who want to try more exotic versions.

I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.

It always has been an objective of @chkoar and myself but we lack some time-bandwidth lately. Reusing some infrastructure would be really useful.

I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.

This would need to be discussed more in details but it could be one way to go.

Regarding cost-sensitive methods, we were thinking about including some. In some way, we thought to trigger imbalanced-learn 1.0.0 to reorganised the module to take into account different approaches.

from smote_variants.

gykovacs avatar gykovacs commented on June 4, 2024

Great! In order to improve the benchmarking, I try to set up some sort of a fully reproducible auto-benchmarking system as some CI/CD job. I feel like this would be the right way to keep the evaluation transparent and fully reproducible. I also think in this way smote-variants can do a good job as an experimentation sandbox behind imblearn.

from smote_variants.

glemaitre avatar glemaitre commented on June 4, 2024

Regarding a continuous benchmark, it is really what I had in mind: scikit-learn-contrib/imbalanced-learn#646 (comment)
@chkoar is more interested in implementing all possible methods and let the user choose. I would at first prefer to reduce the number of samplers. I would consider the first option valid only if we have a good continuous benchmark running and strong documentation referring to it.

How much resources fo your benchmark requires? How long is it taking to run the experiment?

from smote_variants.

gykovacs avatar gykovacs commented on June 4, 2024

Well, the experiment I run and describe in the paper took something like 3 weeks on a 32 core AWS instance, involving 85 methods with 35 different parameter settings, 4 classifiers on top of that with 6 different parameter settings for each, and a repeated k-fold cross validation with 5 splits and 3 repeats, all of that involving 104 datasets.

EDIT:
Training the classifiers on top of the various oversampling strategies takes 80% of the time.

That's clearly too much computational work, but the majority of it was caused by 5-10 "large" datasets and 3-5 very slow, evolutionary oversampling techniques. I think that

  1. reducing the 35 parameter settings to, say, 15,
  2. the classifier parameters combinations to about 3-4,
  3. reducing the datasets to 60-70 small ones
  4. reducing the number of repeats in the repeated k-fold cross-validation
  5. and setting some reasonable timeout for each method

could reduce the work to a couple of hours on a 32-64 core instance.

from smote_variants.

chkoar avatar chkoar commented on June 4, 2024

@glemaitre @gykovacs IMHO the methods that we have to implement or include in imblearn and what method the user will pick is completely unrelated things. We already know that plain SMOTE will do the job. But, since we have the no free lunch theorem I believe that we should not care which is the best to include. We could prioritize by the number of citations (I do not want to set a threshold) or some other thing. For me we need a benchmark just for the timings and we should commit on that. imblearn should have the fastest and accurate (as described in papers) implementations. That's my two cents.

from smote_variants.

gykovacs avatar gykovacs commented on June 4, 2024

@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.

Seemingly, the question is whether we believe the outcome of a reasonable benchmark. I think it might make sense, as the methods users look for should perform well on the "smooth" problems related to real classification datasets, and this might be captured by a benchmark dataset.

One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.

from smote_variants.

chkoar avatar chkoar commented on June 4, 2024

@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.

As I said, I didn't mean about inclusion but for prioritization. So we will not have a bunch of methods initially, as it's @glemaitre concern if I understood correctly.

One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.

I totally agree. That's why I do not find a reason for a method not to be included in the imblearn and only just the top (most cited, best performing across classifiers, etc) ones. As you said always should be a case where a specific over-sampler could perform well.

If that is was the case the main scikit-learn package would have only 5 methods. That's my other two cents.

from smote_variants.

gykovacs avatar gykovacs commented on June 4, 2024

I did some experimentation with CircleCI, it doesn't seem to be suitable for an automated benchmarking in the community subscription plan, too much of a workload even if one relatively small dataset is used.

I also got concerned about my previous idea to use CI/CD for benchmarking. I can imagine a standalone benchmarking solution, which can be installed to any machine, checks out packages and datasets providing some quasi-standard interfaces for benchmarking, runs experiments where code has changed, and publishes the results on a local web-server.

Maintaining the solution and linking something like this to any documentation page doesn't seem to be a burden, yet the solution is flexible and can be moved around in the clouds easily when needed.

I think my company could even finance an instance like this. The main difference compared to CI/CD is that it would run the benchmarking regularly, not on pull requests or any other hooks.

Any comments are welcome! Do you have experience or anything particular on your mind regarding a proper benchmarking solution?

from smote_variants.

zoj613 avatar zoj613 commented on June 4, 2024

@gykovacs Would you be interested in testing your benchmarks on the newer LoRAS and ProWRAS implementations I wrote here: https://github.com/zoj613/pyloras ? I do not think they are implemented in either of the 2 packages.

They do seem promising, at least to my untrained eye.

from smote_variants.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.