Giter Club home page Giter Club logo

simcse's Introduction

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This repository contains the code and pre-trained models for our paper SimCSE: Simple Contrastive Learning of Sentence Embeddings.

**************************** Updates ****************************

Quick Links

Overview

We propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using entailment pairs as positives and contradiction pairs as hard negatives. The following figure is an illustration of our models.

Getting Started

We provide an easy-to-use sentence embedding tool based on our SimCSE model (see our Wiki for detailed usage). To use the tool, first install the simcse package from PyPI

pip install simcse

Or directly install it from our code

python setup.py install

Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See PyTorch official website for instructions.

After installing the package, you can load our model by just two lines of code

from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

See model list for a full list of available models.

Then you can use our model for encoding sentences into embeddings

embeddings = model.encode("A woman is reading.")

Compute the cosine similarities between two groups of sentences

sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)

Or build index for a group of sentences and search among them

sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")

We also support faiss, an efficient similarity search library. Just install the package following instructions here and simcse will automatically use faiss for efficient search.

WARNING: We have found that faiss did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of faiss package.

We also provide an easy-to-build demo website to show how SimCSE can be used in sentence retrieval. The code is based on DensePhrases' repo and demo (a lot of thanks to the authors of DensePhrases).

Model List

Our released models are listed as following. You can import these models by using the simcse package or using HuggingFace's Transformers.

Model Avg. STS
princeton-nlp/unsup-simcse-bert-base-uncased 76.25
princeton-nlp/unsup-simcse-bert-large-uncased 78.41
princeton-nlp/unsup-simcse-roberta-base 76.57
princeton-nlp/unsup-simcse-roberta-large 78.90
princeton-nlp/sup-simcse-bert-base-uncased 81.57
princeton-nlp/sup-simcse-bert-large-uncased 82.21
princeton-nlp/sup-simcse-roberta-base 82.52
princeton-nlp/sup-simcse-roberta-large 83.76

Note that the results are slightly better than what we have reported in the current version of the paper after adopting a new set of hyperparameters (for hyperparamters, see the training section).

Naming rules: unsup and sup represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.

Use SimCSE with Huggingface

Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's transformers:

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL}).

Train SimCSE

In the following section, we describe how to train a SimCSE model by using our code.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1 should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.7.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See our paper (Appendix B) for evaluation details.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Then come back to the root directory, you can evaluate any transformers-based pre-trained models using our evaluation code. For example,

python evaluation.py \
    --model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
    --pooler cls \
    --task_set sts \
    --mode test

which is expected to output the results in a tabular format:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.30 | 84.67 | 80.19 | 85.40 | 80.82 |    84.26     |      80.39      | 81.58 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Arguments for the evaluation script are as follows,

  • --model_name_or_path: The name or path of a transformers-based pre-trained checkpoint. You can directly use the models in the above table, e.g., princeton-nlp/sup-simcse-bert-base-uncased.
  • --pooler: Pooling method. Now we support
    • cls (default): Use the representation of [CLS] token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation). If you use supervised SimCSE, you should use this option.
    • cls_before_pooler: Use the representation of [CLS] token without the extra linear+activation. If you use unsupervised SimCSE, you should take this option.
    • avg: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa (paper), you should use this option.
    • avg_top2: Average embeddings of the last two layers.
    • avg_first_last: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best.
  • --mode: Evaluation mode
    • test (default): The default test mode. To faithfully reproduce our results, you should use this option.
    • dev: Report the development set results. Note that in STS tasks, only STS-B and SICK-R have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the test mode (though numbers are slightly lower).
    • fasttest: It is the same as test, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).
  • --task_set: What set of tasks to evaluate on (if set, it will override --tasks)
    • sts (default): Evaluate on STS tasks, including STS 12~16, STS-B and SICK-R. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.
    • transfer: Evaluate on transfer tasks.
    • full: Evaluate on both STS and transfer tasks.
    • na: Manually set tasks by --tasks.
  • --tasks: Specify which dataset(s) to evaluate on. Will be overridden if --task_set is not na. See the code for a full list of tasks.

Training

Data

For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets. You can run data/download_wiki.sh and data/download_nli.sh to download the two datasets.

Training scripts

We provide example training scripts for both unsupervised and supervised SimCSE. In run_unsup_example.sh, we provide a single-GPU (or CPU) example for the unsupervised version, and in run_sup_example.sh we give a multiple-GPU example for the supervised version. Both scripts call train.py for training. We explain the arguments in following:

  • --train_file: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format.
  • --model_name_or_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.) and RoBERTa-based models (RoBERTa-base, RoBERTa-large, etc.).
  • --temp: Temperature for the contrastive loss.
  • --pooler_type: Pooling method. It's the same as the --pooler_type in the evaluation part.
  • --mlp_only_train: We have found that for unsupervised SimCSE, it works better to train the model with MLP layer but test the model without it. You should use this argument when training unsupervised SimCSE models.
  • --hard_negative_weight: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).
  • --do_mlm: Whether to use the MLM auxiliary objective. If True:
    • --mlm_weight: Weight for the MLM objective.
    • --mlm_probability: Masking rate for the MLM objective.

All the other arguments are standard Huggingface's transformers training arguments. Some of the often-used arguments are: --output_dir, --learning_rate, --per_device_train_batch_size. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the evaluation section) and save the best checkpoint.

For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

Hyperparameters

We use the following hyperparamters for training SimCSE:

Unsup. BERT Unsup. RoBERTa Sup.
Batch size 64 512 512
Learning rate (base) 3e-5 1e-5 5e-5
Learning rate (large) 1e-5 3e-5 1e-5

Convert models

Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER} to convert it. After that, you can evaluate it by our evaluation code or directly use it out of the box.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Tianyu ([email protected]) and Xingcheng ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use SimCSE in your work:

@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}

SimCSE Elsewhere

We thank the community's efforts for extending SimCSE!

simcse's People

Contributors

ak391 avatar bcol23 avatar danqi avatar gaotianyu1350 avatar huybery avatar peregilk avatar uzay-g avatar voidism avatar yaoxingcheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simcse's Issues

about fixed 0.1 dropout rate

Simcse is a amazing work! Appreciate your hard work.
I have a question. how you implemented the dropout fixed 0.1 operation in the code?

Question about the alignment-uniform figure.

In the paper, it says ' We visualize checkpoints every 10 training steps and the arrows indicate the training direction. '
the calculation of alignment and uniformity is done on a training batch or some test set?

Huggingface datasets version 1.2.1 raises ConnectionError in chinese region

Hi🥰

ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.2.1/datasets/text/text.py

I encountered a connection error as above

and figured out that Huggingface had released new versions of datasets (datasets>=1.3.0) to make the library work in offline mode, see this.

I solved it by upgrading the datasets to version 1.3.0, but wonder whether it could result in anything incompatible.

Thanks.

句子嵌入用于聚类

作者您好:
感谢您杰出的工作并把它开源出来。关于用预训练模型SimCSE编码获得的句子嵌入,我有一些问题向您请教:
SimCSE获得的句子嵌入是否适合聚类任务呢?我尝试将SimCSE预训练模型用于FewRel数据集,并将编码得到的句子嵌入直接使用k-means算法进行聚类,但是聚类效果并不好,是否是编码向量太过相似而不够有区分度呢?还是不适用于文本聚类任务?

Question about dual encoder

In the paper, it is written that dual encoder framework hurt performance. However, when dual encoders share parameters, it should achieve the same performance as simsce, am I right ?

Implementation details of the dropout mask.

First, I appreciate your hard work.

I'm not sure about the "dropout mask" in Chapter 3. In the following words:

feed the same input to the encoder twice by applying different dropout masks

Do you mean: "Simply feed the input to the encoder, and during training, the encoder will automatically apply different dropout mask (from nn.dropout) to the input"? The structure of BERT remains unchanged, and we don't need to manually create the mask, right?

the format of NLI dataset for training

how do you format the input for the supervised training by NLI data?
it is not straightforward from the code for this question, and it is not explicitely described in the paper
since your pooler_output is reshaped to (bs, num_sent, hidden)
for the triplet set {xi, xi_positive, xi_negtive}, is it formated like this ?
[CLS] some text tokens for xi [SEP]
[CLS] some text tokens for xi_positive [SEP]
[CLS] some text tokens for xi_negative [SEP]

and several triplets above form the batch ?

thanks

Some questions about the paper

The paper is really interesting. Appreciate your hard work.

I have two questions:

  1. Really good insight about using Alignment and Uniformity (A&U) to compare and analyze different approaches. However, given that different models might have different similarity ranges, so is the absolute A&U value of different models really comparable to each other(e.g. in Figure 3)?
  2. What is the training data for SimCSE in each experiment? I did not find the corresponding texts (maybe I missed them).

Can not reproduced the results in the paper.

Hello, I tired the run_unsup_example.sh files and can not reproduced the results in your paper.
My environment is : Pytorch 1.8.0 + CUDA 11.1 + NVIDIA V100
the hyperparameters are:

{
"best_metric": 0.7652876377731959,
"best_model_checkpoint": "result/my-unsup-simcse-bert-base-uncased",
"epoch": 0.12794268167860798,
"global_step": 250,
"is_hyper_param_search": false,
"is_local_process_zero": true,
"is_world_process_zero": true,
"log_history": [
{
"epoch": 0.06,
"eval_avg_sts": 0.7227670477928896,
"eval_sickr_spearman": 0.7040163496401507,
"eval_stsb_spearman": 0.7415177459456286,
"step": 125
},
{
"epoch": 0.13,
"eval_avg_sts": 0.7386178016746645,
"eval_sickr_spearman": 0.7119479655761333,
"eval_stsb_spearman": 0.7652876377731959,
"step": 250
}
],
"max_steps": 1954,
"num_train_epochs": 1,
"total_flos": 0,
"trial_name": null,
"trial_params": null
}

The eval resutls are:

epoch = 1.0
eval_CR = 86.69
eval_MPQA = 87.75
eval_MR = 81.12
eval_MRPC = 73.53
eval_SST2 = 86.58
eval_SUBJ = 99.6
eval_TREC = 82.67
eval_avg_sts = 0.7371936168707977
eval_avg_transfer = 85.41999999999999
eval_sickr_spearman = 0.7082728178184948
eval_stsb_spearman = 0.7661144159231005
But the results in your paper are:

image

MLP layer in the models.py

Hi,

I am confused why mlp layer in the models.py doesn't change the dim of the vectors. (when pool_type=cls), so what is the function of this layer?

Question about the alignment loss (eq.2) and uniformity loss (eq.3)

Hi, thank you for your excellent work.
I try to calculate the alignment loss (eq.2) and uniformity loss (eq.3) on the STS-B dataset, and re-implement the results in figure 2.
Referring to the instructions in your paper and the code in https://github.com/SsnL/align_uniform/, I import the following functions to calculate the two losses:

def align_loss(x, y, alpha=2):
    return (x - y).norm(p=2, dim=1).pow(alpha).mean()

def uniform_loss(x, t=2):
    return torch.pdist(x, p=2).pow(2).mul(-t).exp().mean().log()

and modify the code from https://github.com/princeton-nlp/SimCSE/blob/main/SentEval/senteval/sts.py#L59-L85 to

def run(self, params, batcher):
    results = {}
    all_sys_scores = []
    all_gs_scores = []
    ################# newly added
    all_loss_align = []
    all_loss_uniform = []
    #################
    for dataset in self.datasets:
        sys_scores = []
        input1, input2, gs_scores = self.data[dataset]
        for ii in range(0, len(gs_scores), params.batch_size):
            batch1 = input1[ii:ii + params.batch_size]
            batch2 = input2[ii:ii + params.batch_size]
            batch_gs_scores = gs_scores[ii:ii + params.batch_size]  # newly added

            # we assume get_batch already throws out the faulty ones
            if len(batch1) == len(batch2) and len(batch1) > 0:
                enc1 = batcher(params, batch1)
                enc2 = batcher(params, batch2)
                ################# newly added
                pos_indices = [i for i in range(len(batch_gs_scores)) if batch_gs_scores[i] >= 4.0]
                enc1_pos = enc1[pos_indices]
                enc2_pos = enc2[pos_indices]
                loss_align = align_loss(enc1_pos, enc2_pos)
                loss_uniform = uniform_loss(torch.cat((enc1, enc2), dim=0))
                all_loss_align.append(loss_align)
                all_loss_uniform.append(loss_uniform)
                ################# 

                for kk in range(enc2.shape[0]):
                    sys_score = self.similarity(enc1[kk], enc2[kk])
                    sys_scores.append(sys_score)
        all_sys_scores.extend(sys_scores)
        all_gs_scores.extend(gs_scores)
        results[dataset] = {'pearson': pearsonr(sys_scores, gs_scores),
                            'spearman': spearmanr(sys_scores, gs_scores),
                            'nsamples': len(sys_scores),
                            'align_loss': np.mean(all_loss_align),  # newly added
                            'uniform_loss': np.mean(all_loss_uniform)}  # newly added
        logging.debug('%s : pearson = %.4f, spearman = %.4f, align_loss = %.4f, uniform_loss = %.4f' %
                      (dataset, results[dataset]['pearson'][0],
                       results[dataset]['spearman'][0], results[dataset]['align_loss'],
                       results[dataset]['uniform_loss']))

However, my results (align_loss and uniform_loss) are:
image
which is inconsistent with the results in figure 2 (about 0.35 for align_loss and about -3.4 for uniform_loss).
I wonder if I make any mistakes? Thanks.

if one entity has multiple mentions,how to deal this situation?

Thanks for sharing you code! in the paper you use cosine cross entropy as loss function
image
and the code in the file models.py
`loss_fct = nn.CrossEntropyLoss()

# Calculate loss with hard negatives
if num_sent == 3:
    # Note that weights are actually logits of weights
    z3_weight = cls.model_args.hard_negative_weight
    weights = torch.tensor(
        [[0.0] * (cos_sim.size(-1) - z1_z3_cos.size(-1)) + [0.0] * i + [z3_weight] + [0.0] * (z1_z3_cos.size(-1) - i - 1) for i in range(z1_z3_cos.size(-1))]
    ).to(cls.device)
    cos_sim = cos_sim + weights

loss = loss_fct(cos_sim, labels)`

but if the input is <mention,entity> pair,and one entity may has multiple mentions,so one entiy may have multiple mentions labeled 1,how to process this situation? looking forward you answer

Question about the calculation of yhat

Hi, I don't understand why we should use np.dot(xx, [1,2,3,4,5]) when calculating yhat. What is that for?

# Training
while not stop_train and self.nepoch <= self.maxepoch:
self.trainepoch(trainX, trainy, nepoches=50)
yhat = np.dot(self.predict_proba(devX), r)
pr = spearmanr(yhat, self.devscores)[0]
pr = 0 if pr != pr else pr # if NaN bc std=0
# early stop on Pearson
if pr > bestpr:
bestpr = pr
bestmodel = copy.deepcopy(self.model)
elif self.early_stop:
if early_stop_count >= 3:
stop_train = True
early_stop_count += 1
self.model = bestmodel
yhat = np.dot(self.predict_proba(testX), r)
return bestpr, yhat

I cannot reproduced the results follow the giving training and evaluation files

Hello, thank you for your released code. I run the "run_unsup_example.sh" without any modification. After one epoch of training, I got the model. Then I evaluated the model by :

python evaluation.py \ --model_name_or_path PATH_TO_MY_MODEL_FOLDER \ --pooler cls \ --task_set sts \ --mode test
I got the following results
image
Furthermore, I tires the uploaded pretrained models. The accuracy is correct. Is there something wrong during my traing?
My envronment:
Pytorch-1.8.0
CUDA-11.1
NVIDIA V100

Regarding the loss function calculation in the code

Hi,

Thanks for sharing this very helpful implementation.

I have a question about the a difference between the loss design in the paper and in the code as in the code I could find it as :

    loss_fct = nn.CrossEntropyLoss()

    # Calculate loss with hard negatives
    if num_sent == 3:
        # Note that weights are actually logits of weights
        z3_weight = cls.model_args.hard_negative_weight
        weights = torch.tensor(
            [[0.0] * (cos_sim.size(-1) - z1_z3_cos.size(-1)) + [0.0] * i + [z3_weight] + [0.0] * (z1_z3_cos.size(-1) - i - 1) for i in range(z1_z3_cos.size(-1))]
        ).to(cls.device)
        cos_sim = cos_sim + weights

    loss = loss_fct(cos_sim, labels)

This does not show the denominator component in the loss function shown in the paper Eq.5 (i.e., normalization with all negative samples in the batch). Also, what does the labels vector mean here? It is an arrange of values with start=0 and end=cos_sim.size(0) ... what does this mean when used with the nn.CrossEntropyLoss() loss function?

Cannot reproduce the result~

Hello, and thank you for this useful code! I tried to reproduce the unsupervisd BERT+SimCSE results, but failed. My environment setup is as follows:

pytorch=1.7.1
cudatoolkit=11.1
Single RTX 3090
The following script is the training script I used (exactly the same as run_unsup_example.sh).

python train.py
--model_name_or_path bert-base-uncased
--train_file data/wiki1m_for_simcse.txt
--output_dir result/my-unsup-simcse-bert-base-uncased
--num_train_epochs 1
--per_device_train_batch_size 64
--learning_rate 3e-5
--max_seq_length 32
--evaluation_strategy steps
--metric_for_best_model stsb_spearman
--load_best_model_at_end
--eval_steps 125
--pooler_type cls
--mlp_only_train
--overwrite_output_dir
--temp 0.05
--do_train
--do_eval
--fp16
"$@"
However, there is a runtimeerror when training is finished. I obtained following evaluation results:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 64.28 | 79.15 | 70.99 | 78.38 | 78.26 | 75.62 | 67.58 | 73.47 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I think the gap (2.8 in average) is too large. Is it because of the error? How to obtain ~76 results in STS tasks?

loss is a very small value

hi,thank you for your excellent work.

I found, loss is a very small number from the beginning of training.

When my batch_size is 4, loss is about 0.1 of step 1.

I also tried the dropout enhancement method (batch_size=96) on my own data set. I also found that the loss was about 1.8 at the beginning, and it dropped to about 0.03 with only two steps. Due to the use of dropout, there are actually two doc/sentence's hidden is actually very close, so the loss is very small and easy to understand, but I am puzzled that such a small loss means that the gradient is also very small, which always feels a bit strange.

In other words, it feels that the pre-training model BERT can distinguish positive and negative cases very well from the beginning, so the loss is very small. So why this method can bring so much effect improvement.

I want to know how the loss size changes during your training. I want to confirm whether this situation I am talking about is a problem of model collapse.

Any Plans for Multilingual Versions?

This is a great work and has many potential applications.

Are there any plans to support multilingual text? For example, adapting from mBERT?

Thanks,
Peixiang

Error when run run_unsup_example.sh

Hi,
I have followed the step in readme to setup my environment. And when I run evaluation.py everything is fine. But It will jump out of the error below when I try to run run_unsup_example.sh .

Traceback (most recent call last):
  File "train.py", line 581, in <module>
    main()
  File "train.py", line 563, in main
    results = trainer.evaluate(eval_senteval_transfer=True)
  File "/home/coh4ry7z/SimCSE/src/trainers.py", line 129, in evaluate
    results = se.eval(tasks)
  File "./SentEval/senteval/engine.py", line 59, in eval
    self.results = {x: self.eval(x) for x in name}
  File "./SentEval/senteval/engine.py", line 59, in <dictcomp>
    self.results = {x: self.eval(x) for x in name}
  File "./SentEval/senteval/engine.py", line 127, in eval
    self.results = self.evaluation.run(self.params, self.batcher)
  File "./SentEval/senteval/binary.py", line 49, in run
    enc_input = np.vstack(enc_input)
  File "<__array_function__ internals>", line 5, in vstack
  File "/home/coh4ry7z/.local/share/virtualenvs/SimCSE-IAMR1KOs/lib/python3.8/site-packages/numpy/core/shape_base.py", line 283, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

Thanks reply!

The Symmetry of loss function

Given a sentence xi, after dropout twice, we can get its representations of hi^z1, hi^z2.
In Equation(4) (Section 3 Unsupervised SimCSE), it seems you only use contrastive_loss(hi^z1, hi^z2) as the training objective, have you tried using 1/2contrastive_loss(hi^z1, hi^z2) + 1/2contrastive_loss(hi^z2, hi^z1) (i.e., the loss in paper ''A Simple Framework for Contrastive Learning of Visual Representations'')?

关于有监督训练方法的一点疑问

您好高天,想问下在有监督的方法中对于每一个对于输入句子,你们是否仍然采用了具有不同dorpout mask的两个不同的embedding。那么在计算loss时 我理解每一个陈述句子产生的embedding都会对应两个由alignment产生的positive embedding 对吗

SimCSE reproduction effect fluctuates

excellent work!
I take the liberty of offering some ideas about the details of the experiment.

  1. Different downstream task datasets rely on MLM tasks for further pre-training, and the training weights of MLM tasks need to be finely adjusted to fit into the multi-task learning framework. Is it possible to provide an ablation experiment that eliminates DAPT (MLM task)?
  2. The downstream tasks in paper act on unsupervised inference tasks, and the baselines for comparison are algorithms such as Bert-whiting and Bert-flow. Can the work of SimCSE be defined as a new paradigm for unsupervised learning data enhancement in NLP? Or will this provide a whole new approach to pre-training?
  3. Based on 2, have you considered the effect of contrast sentences on FineTune in classification, named entity recognition, and reading comprehension tasks?
    Thanks reply!

Error while loading the dataset

Hello, thanks for releasing the code. I encountered an error while loading the dataset. I run the dowload_wili.sh firstly and then bash ./run_unsup_example.sh
if data_args.train_file is not None: data_files["train"] = data_args.train_file extension = data_args.train_file.split(".")[-1] if extension == "txt": extension = "text" if extension == "csv": datasets = load_dataset(extension, data_files=data_files, cache_dir='./data/', delimiter="\t" if "tsv" in data_args.train_file else ",") else: datasets = load_dataset(extension, data_files=data_files, cache_dir='./data/')

the error is :
Traceback (most recent call last):
File "train.py", line 582, in
main()
File "train.py", line 308, in main
datasets = load_dataset(extension, data_files=data_files, cache_dir=DATA_DIR)
File "/home/sysadmin/anaconda3/envs/zjl/lib/python3.6/site-packages/datasets/load.py", line 591, in load_dataset
path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True
File "/home/sysadmin/anaconda3/envs/zjl/lib/python3.6/site-packages/datasets/load.py", line 267, in prepare_module
local_path = cached_path(file_path, download_config=download_config)
File "/home/sysadmin/anaconda3/envs/zjl/lib/python3.6/site-packages/datasets/utils/file_utils.py", line 343, in cached_path
max_retries=download_config.max_retries,
File "/home/sysadmin/anaconda3/envs/zjl/lib/python3.6/site-packages/datasets/utils/file_utils.py", line 617, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.2.1/datasets/text/text.py

Question about Figure 2 in the paper

Thank you for your hard work.

I have a question.
I am not sure what the point of training sentence embeddings in extreme cases (no dropout and fixed 0.1) is.
I think that the same [CLS] embeddings would be output from BERT and the cosine similarity would always be 1.0.

Question about Figure4 in the paper

Thanks for your hard work!
Here i have a question:According to Figure 4 in the paper,you say "Compared to all the baseline models,both unsupervised and supervised SimCSE better distinguish sentence pairs with different levels of similarities".what is the criteria?mean or variance?
Thanks again!

AutoTokenizer.from_pretrained Failed

In the Model List, all models can be successfully loaded via the sample code in "Use SimCSE with Huggingface", except for two models "princeton-nlp/unsup-simcse-roberta-base" and "princeton-nlp/unsup-simcse-roberta-large".

Traceback:
tokenizer = AutoTokenizer.from_pretrained(model_path)
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
...
with open(merges_file, encoding="utf-8") as merges_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Please let me know if anyone succeed in this.

Avg pooling and cls results in unsupervised models

Hi, I run the experiments with avg pooling and cls, separately in the unsupervised setting, but i find that the result of cls exceeds the results of avg pooling by almost 18.0%.
Could you explain this phenomenon?
Thanks.

error when training unsupverised simcse

When i run run_unsup_example.sh and when i almost finished training, an error happend:

Traceback (most recent call last):
File "train.py", line 584, in
main()
File "train.py", line 548, in main
train_result = trainer.train(model_path=model_path)
File "/home/v-nuochen/SimCSE/simcse/trainers.py", line 464, in train
tr_loss += self.training_step(model, inputs)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/transformers/trainer.py", line 1248, in training_step
loss = self.compute_loss(model, inputs)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/transformers/trainer.py", line 1277, in compute_loss
outputs = model(**inputs)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward
return self.gather(outputs, self.output_device)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
for k in out))
File "", line 6, in init
File "/home/v-nuochen/.local/lib/python3.6/site-packages/transformers/file_utils.py", line 1383, in post_init
for element in iterator:
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in
for k in out))
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 71, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/comm.py", line 230, in gather
return torch._C._gather(tensors, dim, destination)

RuntimeError: Input tensor at index 7 has invalid shape [2, 2], but expected [2, 9]
100%|█████████████████████████████████████████████████████████████████████████████████▉| 1953/1954 [18:36<00:00, 1.75it/s]

Could you please tell me why?

is this should be distance?

cosine(embeddings[0], embeddings[1]) is the cosine similarities of 0 and 1,then 1-cosine should be called distance of 0 an 1?

##Calculate cosine similarities
##Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between "%s" and "%s" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between "%s" and "%s" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

load model error

hello, I load your SimCSE model.
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")
Error:ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

Cannot reproduce results

Hello, and thank you for this useful code! I tried to reproduce the unsupervisd BERT+SimCSE results, but failed. My environment setup is as follows:

  • pytorch=1.7.1 (I also tested 1.8.0)
  • cudatoolkit=10.1
  • Single RTX 2080 Ti

The following script is the training script I used (exactly the same as run_unsup_example.sh).

python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir result/my-unsup-simcse-bert-base-uncased \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --mlp_only_train \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

Then, I obtained following evaluation results:

$ python evaluation.py --model_name_or_path result/my-unsup-simcse-bert-base-uncased/ --pooler cls_before_pooler --task_set sts --mode test
(some log ...)
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 65.14 | 79.35 | 70.48 | 80.72 | 76.45 |    74.21     |      70.97      | 73.90 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I also evaluated the pretrained model (this result are similar to results reported in #25):

$ python evaluation.py --model_name_or_path princeton-nlp/unsup-simcse-bert-base-uncased --pooler cls_before_pooler --task_set sts --mode test
(some log ...)
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 68.40 | 82.41 | 74.38 | 80.91 | 78.56 |    76.85     |      72.23      | 76.25 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I think the gap (2.35 in average) is too large. Is there something wrong in the above training/evaluation scripts? How to obtain ~76 results in STS tasks?

Confusion in run_unsup_examples.sh

hi,
I have another queation:when running run_unsup_example.sh with '--do_train True',it still goes directly to evaluate.
Thans a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.