minqi824 / adbench Goto Github PK

Official Implement of "ADBench: Anomaly Detection Benchmark", NeurIPS 2022.

License: BSD 2-Clause "Simplified" License

Python 100.00%

anomaly-detection benchmark data-mining deep-learning machine-learning outlier-detection semi-supervised-learning supervised-learning unsupervised-learning data-sicence

adbench's People

Contributors

Stargazers

Watchers

Forkers

qq-jiang qxiaobu liyazhao fengkoushangdezzx matiassanabria liujie40 temitope-benson durdendong nimritakoul hangeramber kiteflykid johnnytang94 demoninpiano dimka108 tanquangduong fengduqianhe collvey vv-869 maxpark breakend2010 vstiwari gayansamuditha hallaai varunia igormolinah2o askery tqd9563 vuongdxcotai creative-research-project-v1-1 zzknight kreattang haskucy lsc-1 ftwh edwinlzw stevenwangnpu hmy626 ljqcn101 mk-runner chunyaomax hangtingye qwe79137 wuxiaofei98 cseale hamidrezakaboli nnnnmj tornadopeng yfy324 danielboiar kno10 schaudge ylincen neuralclassifier mashuaigood aifeixuelo ninhpham 717hqliu fzingler lx12633036 meetshrihari kuper7 vallyw dengbocong bprucka-lilly shalevy1 dammsi eva0417 lvhualong sczhai dennisweiss olafur-andri techthiyanes sadiasharmin noppolanak nuatel tobereality jarygrace lizhuo97 huliqujing filipsan myyxy shryu8902 allen15rg soroushnilton pboes fengc-cqu guhuary khang019 vicliv ethicalsecurity-agency haibara-z shaunchen1 memory541 iq-scm mauradion holmes-gu joysilas389 avilog tang4109 ginxandxbitters

adbench's Issues

About the Integrated Library

May I ask if you have considered making it into an integrated library in the future? Thanks!

Platform

It there going to be a platform so I can evaluate my method on it?

The requirements.txt file restricts the version of PyOD to 1.0.0, but not any of the other libraries. However, the newest version of scikit-learn and tensorflow throws errors for some models (LODA and DeepSVDD for example). You should either restrict scikit-learn and tensorflow to previous versions or use the newest version of PyOD. It makes dealing with ADBench very annoying with creating my requirements.txt for my project. It is related to this issue yzhao062/pyod#406.

The data generator fails to generate correct number of training data

Hi Dear,

I found in the code, DataGenerator.generator() can not generate data properly. The parameters:

at_least_one_labeled does not work.
la and test_size are not consistent.

Thank you for your assistance.
Bryan

Include ELKI for some 20+ additional algorithms

ELKI, which can easily be invoked from command line, provides many additional algorithms missing from this benchmark, such as:

DB Outlier
HilOut
KNNDD (not the same as KNN Outlier)
KNNSOS
KNN-weight
Local Isolation Coefficient
ODIN
Reference-Based outlier detection
SOS
HySort OD
ALOCI
INFLO
KDEOS
LDF
LDOF
LOCI
LoOP
SimplifiedLOF
VarianceOfVolume
ABOD / FastABOD, LB-ABOD
IDOS
ISOS
LID
GLOSH for HDBSCAN*

In other cases, it may be desirable to compare the performance of different implementations:

Isolation Forest
kNN Outlier
LOF
COF
CBLOF
Because sometimes one implementation may be better than another.

fatal: early EOF fatal: fetch-pack: invalid index-pack output

下载模型时报错：
PS D:\PyCharm> git clone https://github.com/Minqi824/ADBench.git
Cloning into 'ADBench'...
remote: Enumerating objects: 1074, done.
remote: Counting objects: 100% (189/189), done.
remote: Compressing objects: 100% (94/94), done.
error: RPC failed; curl 18 HTTP/2 stream 5 was reset0 KiB/s
error: 995 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

ADBench with custom data

Hello, I am trying to replicate the demo notebook but using a different open source data. I am getting keyerror. I have tried changing the data as well and have reduced the number of rows as well. Please help.

Attaching screenshots

Error in model fitting

Hello guys!
Super amazing job! Thank you.
I have tried first examples, but some don´t run well, could you help me please?
thank you so much.

CODE:

# customized model on ADBench's datasets
from adbench.run import RunPipeline
from adbench.baseline.Customized.run import Customized

# notice that you should specify the corresponding category of your customized AD algorithm
# for example, here we use Logistic Regression as customized clf, which belongs to the supervised algorithm
# for your own algorithm, you can realize the same usage as other baselines by modifying the fit.py, model.py, and run.py files in the adbench/baseline/Customized
pipeline = RunPipeline(suffix='ADBench', parallel='supervise', realistic_synthetic_mode=None, noise_type=None)
results = pipeline.run(clf=Customized)

# customized model on customized dataset
import numpy as np
dataset = {}
dataset['X'] = np.random.randn(1000, 20)
dataset['y'] = np.random.choice([0, 1], 1000)
results = pipeline.run(dataset=dataset, clf=Customized)
print(results)

KIND OF REPETITIVE OUTPUT:

generating duplicate samples for dataset 39_vertebral...
current noise type: None
{'Samples': 1000, 'Features': 6, 'Anomalies': 138, 'Anomalies Ratio(%)': 13.8}
Error in model fitting. Model:Customized, Error: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'adbench.baseline.Customized.model.LR'> with constructor (self, *args, **kwargs) doesn't  follow this convention.
Current experiment parameters: ('39_vertebral', 1.0, 2), model: Customized, metrics: {'aucroc': nan, 'aucpr': nan}, fitting time: None, inference time: None

python 3.10.11
pyod = 1.0.0
MAC M2, Ventura 13

I FOUND THAT probably has to do with how parameters are feed, but i really dont think this could be the solution in t his ca se
https://stackoverflow.com/questions/40025406/inherit-from-scikit-learns-lassocv-model

Thank you again for your help

CV in ADBench

Hi,

I am a little new to Anomaly detection but I was curious about what is the right way to do cross validation while using ADBench as the test and train samples are already split via datagenerator. An easy way will be to concatenate test and train datasets and then put them in the CV loop, but is there a cleaner way possible?

Data set choice: pay attention to use unbalanced data

Data sets with 50% anomalies are not anomaly detection!

More data sets does not mean more meaningful results, because "garbage in, garbage out".
One of the big problems with current anomaly detection research is that we do not use good data sets to evaluate results, hence everything works sometimes by chance, and there is little systematic benefits observable because the data sets are not properly labeled as anomalies.
I am by now convinced that from most of the commonly used data sets, you cannot draw meaningful conclusions because of unsuitable labeling.

dependency installation:

Shall we add a setup.py to ensure that all the dependency are installed?

I needed to install it.

The BaseADDataset can not import

Hello, when I run deepsad.py, there's an error that BaseADDataset cannot import, but two weeks ago it can be imported. May the problem is that package's version is outdated？
Thank you for providing help!

passing ratio information in fit() derived from test-dataset

Shall we avoid passing ``ratio=sum(self.data['y_test']) / len(self.data['y_test'])''

ADBench/run.py

Lines 206 to 207 in f3a9e94

 self.clf = self.clf.fit(X_train=self.data['X_train'], y_train=self.data['y_train'], 

 ratio=sum(self.data['y_test']) / len(self.data['y_test']))

@Minqi824 @yzhao062

Dataset Source/Link

Thanks for the great job! I wonder if it's possible to provide the link/source of the dataset so we can know more about them? Thanks a lot.

copula function error in some datasets

I am getting errors when running synthetic dependency anomalies for multiple datasets. I found this remark in data_generator.py "# we found that copula function may occur error in some datasets". How did you overcome this issue? The dependency anomalies fail to generate.

ImportError: cannot import name 'DataGenerator' from 'data_generator'

ImportError: cannot import name 'DataGenerator' from 'data_generator' (/Users/xxxx/opt/miniconda3/envs/py3.9/lib/python3.9/site-packages/data_generator/init.py)

Any suggestion on how to fix it?

Paralle computing to tackel large-scale data.

Great and Enormours Work!
Do we have paralle computing setting to process large-scale data? Cause in reality, big data condition is more common and difficult.

Broken Link in README

This link is broken:

ADBench/README.md

Line 62 in 783cf9f

 - **Benchmark your anomaly detection algorithms**: see [notebook](https://github.com/Minqi824/ADBench/blob/main/demo.ipynb) for instruction. 

ALOI dataset description

I don't find any description of ALOI dataset in ADBench paper. And only an inference link of paper https://arxiv.org/pdf/1503.01158.pdf , and I can't find "ALOI" key words in this paper. Can you give more description about ALOI Dataset ?

	self.clf = self.clf.fit(X_train=self.data['X_train'], y_train=self.data['y_train'],
	ratio=sum(self.data['y_test']) / len(self.data['y_test']))