docta-ai / docta Goto Github PK

View Code? Open in Web Editor NEW

3.0K 3.0K 174.0 28.42 MB

A Doctor for your data

License: Other

Python 99.66% Shell 0.34%

data data-centric-ai data-centric-machine-learning data-curation data-diagnosis language-model rlhf

docta's People

Contributors

Stargazers

Watchers

Forkers

dina-fdu ai-jie01 anonymizedgithub handsomed123 plag-it-hub shonwuo fj-oms joehart03 pinkagable blackface111 lildragonz whfay wduan123 zhanghao5683934 tellmewhy1122 wanyue123456 hsuanwudev zlx1134558955 k8s9 hs991023 cndair gonefot ask-f dondcan emily6688 taishibr lycokie jingzhengli zwzhu-d gitai123 amelia-fe qiuyanjs rivertown98 jingli-wtbox dlimeng zhimengj0326 wozuidiao wenxincoder ethanchancc18 qincrm xycode-kerman positioner lmclmc haochenglouis wlr19982cc 897ksf www2341 gaohuan2015 au389a fgsagsa writeln2 cidert marvinspring j-w-ca saigona buttalk baishunyado tang-juan samhoiu metadith elvistai swwtech jingxio rollpika rollmanall arcns2 macv520 awekling fskeo xxsoftware mistyr0se n0wwa vamoko s8xy maigone wanghang7894 coder-drinker farmingtong spicyguml y1idjfe sssfldddsa hay-man nicbair obsidian6s o78kay crusadercode dayzpsc endlesst zaku-zaku yzhdm parkmeow turbogpt aquastar tommyip2009 yaoyao1995 nnistudio kai2k9 matterporter holmescao zzu-hzc

docta's Issues

how does sample_size influence the results?

Hi,

I saw there are different setups forsample_size for hoc_cfg and detect_cfg in different experiments.
Some experiments use the same values: 35000 for both for hoc_cfg and detect_cfg in docta_cifar10.py .
Other use different values: 15000 for hoc_cfg and 350000 detect_cfg in docta_rare_pattern_cifar10.ipynb.

I tried running docta_cifar10.py with different sample_size values and I get different results with every different setup.

I do not quite understand how sample_size influences the results and how should I select the sample_size for a dataset.
Thank you.

image preprocessing

Hi.
Thank you for this interesting package.

I was looking at the examples in the tools, more precisely at docta_rare_pattern.py and using it with the following command:
%run ./tools/docta_rare_pattern.py --feature_type 'embedding' --suffix 'c1m_subset' , following the demo in the notebook docta_rare_pattern_clothes.ipynb.

In this case the dataset is created using:
dataset = Customize_Image_Folder(root=cfg.data_root, transform=None).

Looking at the Customize_Image_Folder(), given that transform =None the image will be transformed using the following function:

if transform is None:
            self.transform = transforms.Compose([
                transforms.Resize(256),
                transforms.CenterCrop(224),
                transforms.ToTensor(),
                transforms.Normalize((0.6959, 0.6537, 0.6371), (0.3113, 0.3192, 0.3214)),
            ])

and followed up by:

feature_all.append(tmp_feature.permute(1, 2, 0).numpy().astype(np.uint8))

But then in the docta_rare_pattern.py file the pre_processor.encode_feature() is used to get the image embeddings but also customized the data (performs another normalization step before extracting the embeddings) with:
dataset_list += [CustomizedDataset(feature=self.dataset.feature, label=self.dataset.label, preprocess=preprocess)], where the preprocess function is retrieved from get_encoder():
model_embedding, _, preprocess = open_clip.create_model_and_transforms(self.cfg.embedding_model)

The CLIP preprocessing function seems to be this one:

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=warn)
    CenterCrop(size=(224, 224))
    <function _convert_to_rgb at 0x7ff56ae71f30>
    ToTensor()
    Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
)

I am wondering why the images are normalized two times: one within the Customized_Image_Folder() and a second time with the preprocessing function from open_clip.
Are there any steps that I am missing? I cannot fully understand why the double normalization & preprocessing.
Could you please explain why this strategy is used in the method?

I am greatly thankful for any hints and insights that you can provide.
Thank you for your time.

Later edit: I also noticed this in docta_cifar10.py where the data is loaded with Cifar10_noisy(cfg, train=True) which has a transform function defined and then the pre_process.encode_feature() is called which customized the data with another preprocessing step.

I apologize for the long post.

对hoc, detect参数配置有疑惑

你好，感谢你的工作。我在实际使用的时候对一些代码中没有解释参数有些疑惑，比如already_2nn以及num_rounds
如果我想对40W+的43分类数据进行docta，你有建议的参数设置吗？

hoc_cfg = dict(
max_step = 1501,
T0 = None,
p0 = None,
lr = 0.1,
num_rounds = 50,
sample_size = 280000,
already_2nn = False,
device = 'cuda'
)

detect_cfg = dict(
num_epoch = 51,
sample_size = 280000,
k = 43,
name = 'simifeat',
method = 'rank'
)

docta-ai / docta Goto Github PK

docta's People

Contributors

Stargazers

Watchers

Forkers

docta's Issues

how does sample_size influence the results?

image preprocessing

对hoc, detect参数配置有疑惑

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent