ma-lab-berkeley / crate Goto Github PK

View Code? Open in Web Editor NEW

1.1K 20.0 91.0 57.18 MB

Code for CRATE (Coding RAte reduction TransformEr).

License: MIT License

Python 100.00%

compression sparsification transformer-architecture white-box-architecture

crate's People

Contributors

Stargazers

Watchers

Forkers

jiayichengcode xiaohui96 ai-hub-deep-learning-fundamental jingmouren wang91zhe mengk322 schaudge cuijianzhu lzx-buaa vinbo shadowkun zmvictor dl-vit moqingxinai techthiyanes liangofthechen tengyuantuohai-113 standardgalactic jtbbb-j geometrylearner tiancivalen shenlh vincent-crawl forthing holm-xie wangchenmin gary109 franklin-zhang0 yunzaitianshang zhangzw12319 gptalgopro chunjieshan abexit strategist922 lensea saakethmm kachayev henrymachiyu czthehusky dumpmemory dreamplayer-zhang wgc20 bruinxiong 1-8op waiting-gy 54457616 stephenzkang xgao0613 codeaudit matveyshishov mofanh sorokinvld allchain shixiaoyu321 liujuncn jerdbo ai-hub-deep-learning-fundamental wendeldr xinqin23 nzwang flyakite1 evelynmitchell number1jewel56broodsee finmalage-westphold n-tacticusal sporkseneta izederry8 alexteleshi biechi evdcush k2m5t2 jeanru stjordanis marenan hmxiong yykai1 netzkontrast thinkerchina jayjiang99 cznc nnzhangup jinlmsft rockyniu zyxin810 kew-lab asood314 melvynfourier zzmjohn dennicliu forksomeaigctools rayjryang

crate's Issues

Confusion about the Code Implementation

Thanks for your inspiring work!! Several confusion about the code implementation.

1 MSSA

1.1

In the setting, matrix $U$ stands for the orthonormal bases of the Gaussian space. But in the implement, it's a nn.Linear() implementation.

1.2

Besides, in the MSSA equation,

$$ MSSA(Z \mid U_{[K]}) = \beta \begin{bmatrix} U_1, \dots, U_K \end{bmatrix} \begin{bmatrix} SSA(Z \mid U_1) \\ \vdots \\ SSA(Z \mid U_K) \end{bmatrix}. $$

There is a $U^* SSL$ product, where $U$ is supposed to be the same as the $U$ in SSA. But there is another to_out = nn.Linear implementation. I think it destroys the theoretical property.

2. ISTA

In ISTA module, there is a dictionary $D$, where $D^*D\approx I$. However the optimization didn't include this constraint, and the implementation didn't consider this constraint. I think it may result in the property didn't hold.

Be appreciate for the reply! Thanks again!

Taking one further step of whitebox approach

Thank you for open-sourcing such a great work! I have some questions regarding the potential of taking one more step to make the entire CRATE white-box.

As mentioned in the paper, CRATE is trained in a supervised manner with cross-entropy loss to update the dictionary and subspaces parameter, which somehow makes the learning more task-dependent than ReduNet. Is it theoretically rigorous to say that, I can flatten the Z matrix, whose N columns z are learnt representation of tokens with underlying distribution being mixture of Gaussian, then input them to ReduNet and obtain compressed and discriminated representation of the image as a whole? Though I am not sure how the dictionary for tokens should be updated in this case.

关于attention中部分代码的问题

在attention代码中，我发现有一个名为to_out的操作，我无法理解这个操作是用来实现什么功能的
具体代码为：
class Attention(nn.Module):
def init(self, dim, heads = 8, dim_head = 64, dropout = 0.):
super().init()
inner_dim = dim_head * heads
project_out = not (heads == 1 and dim_head == dim)
self.heads = heads
self.scale = dim_head ** -0.5
self.attend = nn.Softmax(dim = -1)
self.dropout = nn.Dropout(dropout)
self.qkv = nn.Linear(dim, inner_dim, bias=False)
self.to_out = nn.Sequential(
nn.Linear(inner_dim, dim),
nn.Dropout(dropout)
) if project_out else nn.Identity()

def forward(self, x):
    w = rearrange(self.qkv(x), 'b n (h d) -> b h n d', h = self.heads)
    dots = torch.matmul(w, w.transpose(-1, -2)) * self.scale
    attn = self.attend(dots)
    attn = self.dropout(attn)
    out = torch.matmul(attn, w)
    out = rearrange(out, 'b h n d -> b n (h d)')
    return self.to_out(out)

在forward得到结果后，最后输出是使用了to_out()操作，但是我在对应的MSSA部分没有找到相应的理论依据，请问可以麻烦解释一下吗

同时，在MSSA模块之前的LayerNorm和ISTA之前的LayerNorm是在代码中的哪部分实现的呢，我没有找到相应的代码

KeyError:'model'

when i run finetune.py ,i met a problem:
command is

python finetune.py --bs 256 --net CRATE_small
--opt adamW --lr 5e-5 --n_epochs 200 --randomaug 1 --data cifar10 --ckpt_dir checkpoint.pth.tar --data_dir cifar-10

Error is

File "D:\dachuang\CRATE\CRATE-main\finetune.py", line 99, in
net.load_state_dict(torch.load(args.ckpt_dir)['model'])
KeyError: 'model'

i wanna to know how to solve this question.
Thanks !

more pretrained weights

I really appreciate your work bro, will you release more pretrained weights in the future?

Is there any example for language?

Hi, thank you for the work! Is there any example for language application?

Can this be applied to languages?

Good job @ CPAL!

How CREAT differs from Transformer

Thank you for your work. I have a question that may seem naive, namely where is your code mainly different from a classic Transformer.

the final task-specific architecture is classification head as the paper informed，but why you show demo in segmentation task？

computing rate reduction in CRATE

@DruvPai Thank you for sharing codes!

I would really appreciate it if you could share the code to compute $R(Z)$, $R^c(Z)$ in this CRATE code base.

Specifically, I want to reproduce the results in Fig. 3, Fig. 4. a, along with $R(Z)$.

Thank you for your help.

can i implement this idea into some open source llm？such as qwen

pretrained CRATE weight?

where can i get the pretrained CRATE weight?

requirement

What version of Python and torch are used？

Linear projection instead of convolution

Hey,

Is there a specific reason why CRATE uses a simple linear layer for creating the patch embeddings, instead of the conv2d which the ViT uses? I don't see this mentioned in the paper.

ask for Figure13、14 code

I'm really appreciate your work.I would like to draw a graph similar to Figure 13 and 14. Can you provide me with the code?

预训练模型

请问可以发布一下在ImageNet-1k上预训练后的模型吗

How should I extract the features from unclassified data?

Does that mean what I need to do is training a specific unsupervised task like MAE? Does that equal contrastive learning?

为什么我做分割效果和论文说的不一样，还没有vit好

where is the inference code？

Difference between crate-demo.pth and model_best.pth.tar (from CRATE-base)

Thanks for the great work!

I noticed there is model size difference ~300MB (crate-demo.pth) vs. ~100MB (re-trained model_best.pth.tar).

Because I retrained the model using the repo code here but the results are different compared with crate-demo.pth, much worse. Just wondering what I am missing here.

Thanks.

The white-box explannation of CLS token

As I read the paper, the interpretation of MSSA and ISTA component for compression and sparsification of token representation are quite clear to me. However, I'm not so sure about the role and interpretation of the $z^l_{[CLS]}$ in each layer. I don't quite understand how $z^l_{[CLS]}$ affects the compression and sparsification and how it is transformed in each layer.

Experiment on Diffusion Models

Thanks for your inspiring work! I am especially interested in the structured denoising and structured diffusion section. However, I couldn not find the experiment results related to this part in the paper. Would you release in the future?