docta-ai / docta Goto Github PK
View Code? Open in Web Editor NEWA Doctor for your data
License: Other
A Doctor for your data
License: Other
Hi,
I saw there are different setups forsample_size
for hoc_cfg
and detect_cfg
in different experiments.
Some experiments use the same values: 35000 for both for hoc_cfg
and detect_cfg
in docta_cifar10.py
.
Other use different values: 15000 for hoc_cfg
and 350000 detect_cfg
in docta_rare_pattern_cifar10.ipynb
.
I tried running docta_cifar10.py
with different sample_size
values and I get different results with every different setup.
I do not quite understand how sample_size
influences the results and how should I select the sample_size
for a dataset.
Thank you.
Hi.
Thank you for this interesting package.
I was looking at the examples in the tools
, more precisely at docta_rare_pattern.py
and using it with the following command:
%run ./tools/docta_rare_pattern.py --feature_type 'embedding' --suffix 'c1m_subset'
, following the demo in the notebook docta_rare_pattern_clothes.ipynb.
In this case the dataset is created using:
dataset = Customize_Image_Folder(root=cfg.data_root, transform=None)
.
Looking at the Customize_Image_Folder()
, given that transform =None
the image will be transformed using the following function:
if transform is None:
self.transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize((0.6959, 0.6537, 0.6371), (0.3113, 0.3192, 0.3214)),
])
and followed up by:
feature_all.append(tmp_feature.permute(1, 2, 0).numpy().astype(np.uint8))
But then in the docta_rare_pattern.py
file the pre_processor.encode_feature()
is used to get the image embeddings but also customized the data (performs another normalization step before extracting the embeddings) with:
dataset_list += [CustomizedDataset(feature=self.dataset.feature, label=self.dataset.label, preprocess=preprocess)]
, where the preprocess function is retrieved from get_encoder()
:
model_embedding, _, preprocess = open_clip.create_model_and_transforms(self.cfg.embedding_model)
The CLIP preprocessing function seems to be this one:
Compose(
Resize(size=224, interpolation=bicubic, max_size=None, antialias=warn)
CenterCrop(size=(224, 224))
<function _convert_to_rgb at 0x7ff56ae71f30>
ToTensor()
Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
)
I am wondering why the images are normalized two times: one within the Customized_Image_Folder()
and a second time with the preprocessing function from open_clip
.
Are there any steps that I am missing? I cannot fully understand why the double normalization & preprocessing.
Could you please explain why this strategy is used in the method?
I am greatly thankful for any hints and insights that you can provide.
Thank you for your time.
Later edit: I also noticed this in docta_cifar10.py
where the data is loaded with Cifar10_noisy(cfg, train=True)
which has a transform function defined and then the pre_process.encode_feature()
is called which customized the data with another preprocessing step.
I apologize for the long post.
你好,感谢你的工作。我在实际使用的时候对一些代码中没有解释参数有些疑惑,比如already_2nn以及num_rounds
如果我想对40W+的43分类数据进行docta,你有建议的参数设置吗?
hoc_cfg = dict(
max_step = 1501,
T0 = None,
p0 = None,
lr = 0.1,
num_rounds = 50,
sample_size = 280000,
already_2nn = False,
device = 'cuda'
)
detect_cfg = dict(
num_epoch = 51,
sample_size = 280000,
k = 43,
name = 'simifeat',
method = 'rank'
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.