drsy / emo Goto Github PK

[ICLR 2024]EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling(https://arxiv.org/abs/2310.04691)

Python 97.38% Shell 2.62%

language-modeling

emo's Introduction

Hi there 👋

😉 I am Siyu Ren.

🎓 I am a fifth-year Ph.D. student majoring in CS at Shanghai Jiao Tong University.

🔎 Currently, my research interest includes Efficient Methods for NLP/Large Language Models and techniques around mechanistic understanding of LLMs.

📚 For my academic publications, please refer to https://drsy.github.io/.

emo's People

Contributors

Stargazers

Watchers

Forkers

gi2wzh lichen914 eltociear yyht xufangzhi dl-loss dumpmemory helpfirecode zzzhangxd techthiyanes haoxin1998 piyushi-0 tomaarsen hiyyg

emo's Issues

代码中提供了三种DEMD和MLE的方式, 请问论文中使用的是哪一种呐?

代码中提供了三种DEMD和MLE的融合方式, 请问论文中使用的是哪一种呐?

Is it possible for you to cite "Learning with a Wasserstein Loss" in the camera-ready version?

Dear authors:
Thank you for your great work advancing the frontier of language model training.

Learning with a Wasserstein Loss (arxiv)
Learning with a Wasserstein Loss (NeurIPS 2015)

This paper represents the first use of the Wasserstein distance (i.e. earth mover’s distance) as a loss for supervised learning.
It considers the problem of learning to predict a non-negative measure over a finite set.
Language Models are essentially learning to predict a non-negative measure over a finite set.

In summary, Learning with a Wasserstein Loss has used a similar method to solve a similar problem. The core idea and core technique are the same, the problem is the same in principle.

Nonetheless, your great work proposes a tractable and effective upper bound for EMD and verifies EMD's effectiveness in language model fine-tuning, which is nontrivial and impressive.

Could you please, by any chance, cite Learning with a Wasserstein Loss in the camera-ready version paper? I believe that will help readers to find related works.

本来想在OpenReview发个Public Comment，但现在不能发了😂

emo_loss 会变成nan

您好，请教一下。
我再qwen 上使用这个loss，进行的是下游任务的微调，原本的loss 是 mle。
使用的是： EMO/language_modeling/gpt2.py 中的emo 。

他在训练的过程中会出现这样的现象：

以及：

请问这个情况大概是为什么？

关于emo loss的疑问

gt_q = (q_grad * one_hot).detach()
q_final = q_grad - gt_q

理论上改完之后loss的值应该不变，但现在使用公式$1 − ( EQ_{\theta})^TEP$算出来修改之前和之后的结果不同，又验证了公式$Q_{\theta}^TCP$，修改前后相同，应该是$Q_{\theta}$本身发生了变化导致$Q_{\theta}^TP$的值不等于1了，所以
emo_loss = (1 - torch.sum(p_contextual_repr * q_contextual_repr, dim=-1))是不是需要再减去一项 torch.sum(gt_q * stable_onehot, dim=-1)？

另外，在ground-truth token处因为$1-E^TE$对角线一直为0，梯度本身也会乘以这个值，相当于ground-truth token处的梯度为0，也不需要更新梯度，这两行代码是否还有存在的必要？

Is there a good way to initialize cost matrix when pretraining from-scratch?

Hi, Thank you for sharing this work.

I wonder the applicability of this method for pretraining from-scratch.
To do that, one needs to build a good initial cost matrix $\mathbb{C} = [C(v_i, v_j)]_{i,j}$. I can come up with a trivial uniform initialization for this. Is there a better way than this?

If you ever tried this method for pretraining from scratch, it would be really helpful to know the results. Thanks!

drsy / emo Goto Github PK

emo's Introduction

Hi there 👋

emo's People

Contributors

Stargazers

Watchers

Forkers

emo's Issues

代码中提供了三种DEMD和MLE的方式, 请问论文中使用的是哪一种呐?

Is it possible for you to cite "Learning with a Wasserstein Loss" in the camera-ready version?

emo_loss 会变成nan

关于emo loss的疑问

Is there a good way to initialize cost matrix when pretraining from-scratch?

gt_q = (q_grad * one_hot).detach()

分布式多机训练, loss 训练 300 step 后会变成负数

是否会导致模式坍塌？

请问训练的时候loss上升是正常现象吗

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent