bm-k / kosimcse-skt Goto Github PK

View Code? Open in Web Editor NEW

48.0 1.0 8.0 720 KB

Simple Contrastive Learning of Korean Sentence Embeddings

Python 41.88% C++ 37.76% Cuda 19.98% C 0.23% Shell 0.16%

natural-language-processing sentence-embeddings sentence-similarity korean-simcse

kosimcse-skt's Introduction

KoSimCSE

Simple Contrastive Learning of Korean Sentence Embeddings Implementation
- SimCSE-[EMNLP 2021]
- [Github] Official implementation of SimCSE

Installation

git clone https://github.com/BM-K/KoSimCSE.git
cd KoSimCSE
git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt
pip install .
cd ..
pip install -r requirements.txt

Training - only supervised

Model
- SKT KoBERT
Dataset
- kakaobrain NLU dataset
  - train: KorNLI
  - dev & test: KorSTS
Setting
- epochs: 3
- dropout: 0.1
- batch size: 256
- temperature: 0.05
- learning rate: 1e-4
- warm-up ratio: 0.05
- max sequence length: 50
- evaluation steps during training: 250
Run train -> test -> semantic_search

bash run_example.sh

Pre-Trained Models

Using BERT pooled [CLS] token representation
- It may be better to use only the [CLS] token representation, not pooled
Pre-Trained model check point
- Google Drive Sharing
- ./output/nli_checkpoint.pt

Performance

Model	AVG	Cosine Pearson	Cosine Spearman	Euclidean Pearson	Euclidean Spearman	Manhattan Pearson	Manhattan Spearman	Dot Pearson	Dot Spearman
KoSBERT^†_SKT	77.40	78.81	78.47	77.68	77.78	77.71	77.83	75.75	75.22
KoSBERT	80.39	82.13	82.25	80.67	80.75	80.69	80.78	77.96	77.90
KoSRoBERTa	81.64	81.20	82.20	81.79	82.34	81.59	82.20	80.62	81.25

KoSentenceBART	77.14	79.71	78.74	78.42	78.02	78.40	78.00	74.24	72.15
KoSentenceT5	77.83	80.87	79.74	80.24	79.36	80.19	79.27	72.81	70.17

KoSimCSE-BERT^†_SKT	81.32	82.12	82.56	81.84	81.63	81.99	81.74	79.55	79.19
KoSimCSE-BERT	83.37	83.22	83.58	83.24	83.60	83.15	83.54	83.13	83.49
KoSimCSE-RoBERTa	83.65	83.60	83.77	83.54	83.76	83.55	83.77	83.55	83.64

KoSimCSE-BERT-multitask	85.71	85.29	86.02	85.63	86.01	85.57	85.97	85.26	85.93
KoSimCSE-RoBERTa-multitask	85.77	85.08	86.12	85.84	86.12	85.83	86.12	85.03	85.99

†: KoSBERT^†_SKT
Performance comparison with other models [KLUE-PLMs].

Example Downstream Task

Semantic Search

python SemanticSearch.py

import numpy as np
from model.utils import pytorch_cos_sim
from data.dataloader import convert_to_tensor, example_model_setting


def main():
    model_ckpt = './output/nli_checkpoint.pt'
    model, transform, device = example_model_setting(model_ckpt)

    # Corpus with example sentences
    corpus = ['한 남자가 음식을 먹는다.',
              '한 남자가 빵 한 조각을 먹는다.',
              '그 여자가 아이를 돌본다.',
              '한 남자가 말을 탄다.',
              '한 여자가 바이올린을 연주한다.',
              '두 남자가 수레를 숲 속으로 밀었다.',
              '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
              '원숭이 한 마리가 드럼을 연주한다.',
              '치타 한 마리가 먹이 뒤에서 달리고 있다.']

    inputs_corpus = convert_to_tensor(corpus, transform)

    corpus_embeddings = model.encode(inputs_corpus, device)

    # Query sentences:
    queries = ['한 남자가 파스타를 먹는다.',
               '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
               '치타가 들판을 가로 질러 먹이를 쫓는다.']

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    for query in queries:
        query_embedding = model.encode(convert_to_tensor([query], transform), device)
        cos_scores = pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
        cos_scores = cos_scores.cpu().detach().numpy()

        top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

        print("\n\n======================\n\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")

        for idx in top_results[0:top_k]:
            print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

Result

Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.6146)
한 남자가 빵 한 조각을 먹는다. (Score: 0.4922)
한 남자가 말을 탄다. (Score: 0.0797)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.0183)
한 여자가 바이올린을 연주한다. (Score: 0.0041)


======================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.5087)
한 여자가 바이올린을 연주한다. (Score: 0.4180)
한 남자가 말을 탄다. (Score: 0.3403)
그 여자가 아이를 돌본다. (Score: 0.2689)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1671)


======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.8106)
한 남자가 말을 탄다. (Score: 0.1910)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1614)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1557)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.1269)

Citing

@article{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   journal={arXiv preprint arXiv:2104.08821},
   year={2021}
}
@article{ham2020kornli,
 title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
 author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
 journal={arXiv preprint arXiv:2004.03289},
 year={2020}
}

kosimcse-skt's People

Contributors

Stargazers

Watchers

Forkers

kosimcse-skt's Issues

[IndexError] tuple index out of range

환경 : ColabPro

!git clone https://github.com/BM-K/KoSimCSE.git
%cd KoSimCSE
!git clone https://github.com/SKTBrain/KoBERT.git
%cd KoBERT
!pip install -r requirements.txt
!pip install .
%cd ..
!pip install -r requirements.txt

!pip install transformers==4.8.1
!pip install folium==0.2.1
!pip install tensorboardX

호환 문제 때문에 해당 패키지들의 버전을 맞추었습니다.

!chmod +x 
!/content/KoSimCSE/run_example.sh!/content/KoSimCSE/run_example.sh

해당 코드로 돌렸을 때 output console 입니다.

Start Training
argparse{
 	 opt_level : O1
	 fp16 : True
	 train : True
	 test : False
	 device : cuda
	 patient : 10
	 dropout : 0.1
	 max_len : 50
	 batch_size : 256
	 epochs : 3
	 eval_steps : 250
	 seed : 1234
	 lr : 0.0001
	 weight_decay : 0.0
	 warmup_ratio : 0.05
	 temperature : 0.05
	 train_data : train_nli_sample.tsv
	 valid_data : valid_sts_sample.tsv
	 test_data : test_sts.tsv
	 task : NLU
	 path_to_data : ./data/
	 path_to_save : ./output/
	 path_to_saved_model : ./output/
	 ckpt : best_checkpoint.pt 
}
using cached model
using cached model
using cached model
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
[INFO] 2021-12-23 05:32:45,674 [ Model Setting Complete ] | file::main.py | line::8
[INFO] 2021-12-23 05:32:45,674 [ Start Training ] | file::main.py | line::11
  0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 28, in <module>
    main(args, logger)
  File "main.py", line 15, in main
    processor.train(epoch+1)
  File "/content/KoSimCSE/model/simcse/processor.py", line 118, in train
    train_loss = self.run(inputs, type='train')
  File "/content/KoSimCSE/model/simcse/processor.py", line 36, in run
    anchor_embeddings, positive_embeddings, negative_embeddings = self.config['model'](inputs, type)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/KoSimCSE/model/simcse/bert.py", line 28, in forward
    attention_mask=positive_attention_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 1001, in forward
    return_dict=return_dict,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 589, in forward
    output_attentions,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 475, in forward
    past_key_value=self_attn_past_key_value,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 408, in forward
    output_attentions,
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 267, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "/content/KoSimCSE/apex/amp/wrap.py", line 21, in wrapper
    args[i] = utils.cached_cast(cast_fn, args[i], handle.cache)
  File "/content/KoSimCSE/apex/amp/utils.py", line 97, in cached_cast
    if cached_x.grad_fn.next_functions[1][0].variable is not x:
**IndexError: tuple index out of range**
Start Testing
argparse{
 	 opt_level : O1
	 fp16 : True
	 train : False
	 test : True
	 device : cuda
	 patient : 10
	 dropout : 0.1
	 max_len : 50
	 batch_size : 256
	 epochs : 3
	 eval_steps : 250
	 seed : 1234
	 lr : 5e-05
	 weight_decay : 0.0
	 warmup_ratio : 0.05
	 temperature : 0.05
	 train_data : train_nli.tsv
	 valid_data : valid_sts.tsv
	 test_data : test_sts_sample.tsv
	 task : NLU
	 path_to_data : ./data/
	 path_to_save : ./output/
	 path_to_saved_model : ./output/best_checkpoint.pt
	 ckpt : best_checkpoint.pt 
}
using cached model
using cached model
using cached model
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'")
[INFO] 2021-12-23 05:33:01,197 [ Model Setting Complete ] | file::main.py | line::8
[INFO] 2021-12-23 05:33:01,197 [ Start Test ] | file::main.py | line::18
Traceback (most recent call last):
  File "main.py", line 28, in <module>
    main(args, logger)
  File "main.py", line 20, in main
    processor.test()
  File "/content/KoSimCSE/model/simcse/processor.py", line 163, in test
    self.config['model'].load_state_dict(torch.load(self.args.path_to_saved_model)['model'], strict=False)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './output/best_checkpoint.pt'
Semantic Search
using cached model
using cached model
using cached model
/content/KoSimCSE/data/dataloader.py:178: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:201.)
  inputs = {'source': torch.LongTensor(tensor_corpus),


======================


Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.6002)
한 남자가 빵 한 조각을 먹는다. (Score: 0.5940)
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.0694)
한 남자가 말을 탄다. (Score: 0.0327)
원숭이 한 마리가 드럼을 연주한다. (Score: -0.0050)


======================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.6490)
한 여자가 바이올린을 연주한다. (Score: 0.3669)
한 남자가 말을 탄다. (Score: 0.2322)
그 여자가 아이를 돌본다. (Score: 0.1980)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1627)


======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7756)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1812)
한 남자가 말을 탄다. (Score: 0.1667)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.1530)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1269)

Processor에서 train 함수 anchor_embeddings, positive_embeddings, negative_embeddings = self.config['model'](inputs, type)
해당 라인에서 문제가 생기는 것은 확인은 했는데 config 생성 당시 문제가 생기는 건지 의문입니다.

SemanticSearch.py error

안녕하세요~
SemanticSearch.py 수행시 get_pytorch_kobert_model() 내에서 model, voc 다운로드시 url이 막혀 다음과 같은 에러가 발생합니다.

Traceback (most recent call last):
  File "/home/motive/PycharmProjects/KoSimCSE_SKT/KoBERT/kobert/utils.py", line 46, in download
    response = requests.get(url, stream=True)
  File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/motive/anaconda3/envs/KoSimCSE_SKT/lib/python3.8/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kobert.blob.core.windows.net', port=443): Max retries exceeded with url: /models/kobert/pytorch/kobert_v1.zip (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2ca9d28a60>: Failed to establish a new connection: [Errno -2] Name or service not known'))

혹시 사전에 다운로드 받아놓은 model, voc 파일이 있다면 공유 받을수 있을지, 다른 우회 방법이 있는지 문의 드립니다~

아래 경로의 유사 error를 찾아 사전 packaging된 model 및 voc 파일을 로드해도 또다른 pkg. dependancy 문제 및 error가 발생하여 이렇게 문의 드립니다.
(참고한 repo. : https://github.com/SKTBrain/KoBERT, https://github.com/SKTBrain/KoBERT/tree/master/kobert_hf)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.