Giter Club home page Giter Club logo

kobart's Introduction

๐Ÿคฃ KoBART

BART(Bidirectional and Auto-Regressive Transformers)๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ ์ผ๋ถ€์— ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ด๋ฅผ ๋‹ค์‹œ ์›๋ฌธ์œผ๋กœ ๋ณต๊ตฌํ•˜๋Š” autoencoder์˜ ํ˜•ํƒœ๋กœ ํ•™์Šต์ด ๋ฉ๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด BART(์ดํ•˜ KoBART) ๋Š” ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ Text Infilling ๋…ธ์ด์ฆˆ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 40GB ์ด์ƒ์˜ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•œ ํ•œ๊ตญ์–ด encoder-decoder ์–ธ์–ด ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋„์ถœ๋œ KoBART-base๋ฅผ ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค.

bart

How to install

pip install git+https://github.com/SKT-AI/KoBART#egg=kobart

Data

Data # of Sentences
Korean Wiki 5M
Other corpus 0.27B

ํ•œ๊ตญ์–ด ์œ„ํ‚ค ๋ฐฑ๊ณผ ์ด์™ธ, ๋‰ด์Šค, ์ฑ…, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ v1.0(๋Œ€ํ™”, ๋‰ด์Šค, ...), ์ฒญ์™€๋Œ€ ๊ตญ๋ฏผ์ฒญ์› ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Tokenizer

tokenizers ํŒจํ‚ค์ง€์˜ Character BPE tokenizer๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

vocab ์‚ฌ์ด์ฆˆ๋Š” 30,000 ์ด๋ฉฐ ๋Œ€ํ™”์— ์ž์ฃผ ์“ฐ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์ด๋ชจํ‹ฐ์ฝ˜, ์ด๋ชจ์ง€ ๋“ฑ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•ด๋‹น ํ† ํฐ์˜ ์ธ์‹ ๋Šฅ๋ ฅ์„ ์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค.

๐Ÿ˜€, ๐Ÿ˜, ๐Ÿ˜†, ๐Ÿ˜…, ๐Ÿคฃ, .. , :-), :), -), (-:...

๋˜ํ•œ <unused0> ~ <unused99>๋“ฑ์˜ ๋ฏธ์‚ฌ์šฉ ํ† ํฐ์„ ์ •์˜ํ•ด, ํ•„์š”ํ•œ subtasks์— ๋”ฐ๋ผ ์ž์œ ๋กญ๊ฒŒ ์ •์˜ํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("์•ˆ๋…•ํ•˜์„ธ์š”. ํ•œ๊ตญ์–ด BART ์ž…๋‹ˆ๋‹ค.๐Ÿคฃ:)l^o")
['โ–์•ˆ๋…•ํ•˜', '์„ธ์š”.', 'โ–ํ•œ๊ตญ์–ด', 'โ–B', 'A', 'R', 'T', 'โ–์ž…', '๋‹ˆ๋‹ค.', '๐Ÿคฃ', ':)', 'l^o']

Model

Model # of params Type # of layers # of heads ffn_dim hidden_dims
KoBART-base 124M Encoder 6 16 3072 768
Decoder 6 16 3072 768
>>> from transformers import BartModel
>>> from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = BartModel.from_pretrained(get_pytorch_kobart_model())
>>> inputs = kobart_tokenizer(['์•ˆ๋…•ํ•˜์„ธ์š”.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4418, -4.3673,  3.2404,  ...,  5.8832,  4.0629,  3.5540],
         [-0.1316, -4.6446,  2.5955,  ...,  6.0093,  2.7467,  3.0007]]],
       grad_fn=<NativeLayerNormBackward>), past_key_values=((tensor([[[[-9.7980e-02, -6.6584e-01, -1.8089e+00,  ...,  9.6023e-01, -1.8818e-01, -1.3252e+00],

Performances

Classification or Regression

NSMC(acc) KorSTS(spearman) Question Pair(acc)
-----------------------------------
KoBART-base 90.24 81.66 94.34

Summarization

  • ์—…๋ฐ์ดํŠธ ์˜ˆ์ • *

Demos

์œ„ ์˜ˆ์‹œ๋Š” ZDNET ๊ธฐ์‚ฌ๋ฅผ ์š”์•ฝํ•œ ๊ฒฐ๊ณผ์ž„

Examples

KoBART๋ฅผ ์‚ฌ์šฉํ•œ ํฅ๋ฏธ๋กœ์šด ์˜ˆ์ œ๊ฐ€ ์žˆ๋‹ค๋ฉด PR์ฃผ์„ธ์š”!

Release

  • v0.5.1
    • guide default 'import statements'
  • v0.5
    • download large files from aws s3
  • v0.4
    • Update model binary
  • v0.3
    • ํ† ํฌ๋‚˜์ด์ € ๋ฒ„๊ทธ๋กœ ์ธํ•ด <unk> ํ† ํฐ์ด ์‚ฌ๋ผ์ง€๋Š” ์ด์Šˆ ํ•ด๊ฒฐ
  • v0.2
    • KoBART ๋ชจ๋ธ ์—…๋ฐ์ดํŠธ(์„œ๋ธŒํ…Œ์ŠคํŠธ sample efficient๊ฐ€ ์ข‹์•„์ง)
    • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ ์‚ฌ์šฉ ๋ฒ„์ „ ๋ช…์‹œ
    • downloder ๋ฒ„๊ทธ ์ˆ˜์ •
    • pip ์„ค์น˜ ์ง€์›

Contacts

KoBART ๊ด€๋ จ ์ด์Šˆ๋Š” ์ด๊ณณ์— ์˜ฌ๋ ค์ฃผ์„ธ์š”.

License

KoBART๋Š” modified MIT ๋ผ์ด์„ ์Šค ํ•˜์— ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๋ฐ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ผ์ด์„ ์Šค ๋‚ด์šฉ์„ ์ค€์ˆ˜ํ•ด์ฃผ์„ธ์š”. ๋ผ์ด์„ ์Šค ์ „๋ฌธ์€ LICENSE ํŒŒ์ผ์—์„œ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

kobart's People

Contributors

bage79 avatar haven-jeon avatar seoneun avatar seujung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kobart's Issues

pretrain ์‹œ ์ž…๋ ฅ๊ฐ’ ํ˜•ํƒœ ์งˆ๋ฌธ

์•ˆ๋…•ํ•˜์„ธ์š”! ์šฐ์„  KoBART๋ฅผ ๊ณต๊ฐœํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๋•๋ถ„์— ํŽธํ•˜๊ฒŒ ํ”„๋กœ์ ํŠธ๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!

์นœ๊ตฌ๋“ค๊ณผ ํ•จ๊ป˜ KoBART๋ฅผ ์ด์šฉํ•˜์—ฌ pegasus๋ฅผ ๊ฐœ๋ฐœํ•˜๋ ค ํ•˜๋Š”๋ฐ, KoBART pretrain ์‹œ ๊ตฌ์ฒด์ ์ธ ์ž…๋ ฅ๊ฐ’์„ ์ฐพ์ง€ ๋ชปํ•ด ์งˆ๋ฌธ๋“œ๋ฆฌ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก  ์ด์ „์—๋„ #2 ์ฒ˜๋Ÿผ ์ด์Šˆ๋ฅผ ์ œ๊ธฐํ•ด์ฃผ์‹  ๋ถ„์ด ๊ณ„์‹ ๋ฐ, ์ œ๊ฐ€ ์ง€์‹์ด ๋ถ€์กฑํ•œ ํƒ“์— ์ž˜ ์ดํ•ด๊ฐ€ ๊ฐ€์ง€ ์•Š์•„ ๋‹ค์‹œ ์งˆ๋ฌธ๋“œ๋ฆฌ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ œ๊ฐ€ ์ดํ•ดํ•œ ๋ฐ”๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ์ธ์ฝ”๋”๋‚˜ ๋””์ฝ”๋”์— ์‹ฃ๊ณ  ๊ณ„์‹ ๋ฐ ํ˜น์‹œ ๋งž๋‚˜์š”?

  • raw text: ์ด๋ ‡๊ฒŒ ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜๋ฉด ๋˜๋‚˜์š”? ๋งž๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.
  • tokenized :
    image

๋ฐ๋ชจํŽ˜์ด์ง€์— ์‚ฌ์šฉ๋œ ๋ฉ”์†Œ๋“œ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

๋ฐ๋ชจํŽ˜์ด์ง€์ฒ˜๋Ÿผ ๊ธฐ์‚ฌ๋ฅผ ๋„ฃ์œผ๋ฉด ์š”์•ฝ๋œ ์ค„๊ฑฐ๋ฆฌ๊ฐ€ ๋‚˜์˜ค๋Š” ๋ฉ”์†Œ๋“œ๋Š” ๋ฌด์—‡์„ ์จ์•ผํ•˜๋‚˜์š”?

ํ˜„์žฌ ๊ฐ€์ƒํ™˜๊ฒฝ ์„ค์ • ๋ฐ install์€ ๋˜์–ด ์žˆ๋Š” ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

`from transformers import BartModel
from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
if name == 'main' :
kobart_tokenizer = get_kobart_tokenizer()
kobart_tokenizer.tokenize("์•ˆ๋…•ํ•˜์„ธ์š”. ํ•œ๊ตญ์–ด BART ์ž…๋‹ˆ๋‹ค.๐Ÿคฃ:)l^o")

model = BartModel.from_pretrained(get_pytorch_kobart_model())
inputs = kobart_tokenizer(['์—ฌ๊ถŒ์—์„œ ์„œ์šธ์‹œ์žฅ ๋ณด์„  ์ถœ๋งˆ๋ฅผ ๊ณต์‹ํ™”ํ•œ ๊ฒƒ์€ ์šฐ ์˜์›์ด ์ฒ˜์Œ์ด๋‹ค. ์ฐจ๊ธฐ ์ด์„ ์— ๋ถˆ์ถœ๋งˆํ•˜๊ณ  ๋ชจ๋“  ๊ฒƒ์„ ๊ฑธ๊ฒ ๋‹ค๋ฉฐ ๋ฐฐ์ˆ˜์ง„์„ ์ณค๋‹ค.์šฐ ์˜์›์€ 13์ผ ๊ตญํšŒ ์†Œํ†ต๊ด€์—์„œ ์ถœ๋งˆ ๊ธฐ์žํšŒ๊ฒฌ์„ ํ•˜๊ณ  "์ด๋ฒˆ ์„ ๊ฑฐ๋Š” ๋Œ€๋‹จํžˆ ์ค‘์š”ํ•œ ์„ ๊ฑฐ"๋ผ๋ฉฐ "๋ฌธ์žฌ์ธ ๋Œ€ํ†ต๋ น์ด ์„ฑ๊ณตํ•œ ๋Œ€ํ†ต๋ น์œผ๋กœ ํ‰๊ฐ€๋ฐ›๋Š๋ƒ, ์•ผ๋‹น์˜ ํ ์ง‘๋‚ด๊ธฐ, ๋ฐœ๋ชฉ์žก๊ธฐ๋กœ ํ˜ผ๋ž€์Šค๋Ÿฌ์šด ๊ตญ์ • ํ›„๋ฐ˜๊ธฐ๋ฅผ ๋ณด๋‚ด์•ผ ํ•˜๋Š๋ƒ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์„ ๊ฑฐ"๋ผ๊ณ  ๋งํ–ˆ๋‹ค.\
                           '], return_tensors='pt')
model(inputs['input_ids'])

`

KoBART summarization fine-tuning์‹œ ์—๋Ÿฌ ๋ฐœ์ƒ

์•ˆ๋…•ํ•˜์„ธ์š” ์ข‹์€ ๋ชจ๋ธ ๋ฐฐํฌํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

KoBART summarization์„ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•ด ์„ค์น˜ ํ›„ fine tuning์„ ํ•˜๊ธฐ ์œ„ํ•ด Read.me์— ์•ˆ๋‚ด๋œ ์•„๋ž˜์˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

[use cpu]
python train.py  --gradient_clip_val 1.0 --max_epochs 50 --default_root_dir logs  --batch_size 4 --num_workers 4

ํ•˜์ง€๋งŒ Validation sanity check ๊ณผ์ •์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.

INFO:root:Namespace(accelerator=None, accumulate_grad_batches=1, amp_backend='native', amp_level='O2', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, batch_size=4, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, checkpoint_path=None, default_root_dir='logs', deterministic=False, distributed_backend=None, fast_dev_run=False, flush_logs_every_n_steps=100, gpus=None, gradient_clip_algorithm='norm', gradient_clip_val=1.0, limit_predict_batches=1.0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_every_n_steps=50, log_gpu_memory=None, logger=True, lr=3e-05, max_epochs=50, max_len=512, max_steps=None, max_time=None, min_epochs=None, min_steps=None, model_path=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', num_nodes=1, num_processes=1, num_sanity_val_steps=2, num_workers=4, overfit_batches=0.0, plugins=None, precision=32, prepare_data_per_node=True, process_position=0, profiler=None, progress_bar_refresh_rate=None, reload_dataloaders_every_epoch=False, replace_sampler_ddp=True, resume_from_checkpoint=None, stochastic_weight_avg=False, sync_batchnorm=False, terminate_on_nan=False, test_file='data/test.tsv', tpu_cores=None, track_grad_norm=-1, train_file='data/train.tsv', truncated_bptt_steps=None, val_check_interval=1.0, warmup_ratio=0.1, weights_save_path=None, weights_summary='top')
using cached model
using cached model
using cached model
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
INFO:root:number of workers 4, data length 34242
INFO:root:num_train_steps : 107006
INFO:root:num_warmup_steps : 10700
2021-11-05 10:27:55.060417: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2021-11-05 10:27:55.069132: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

  | Name  | Type                         | Params
-------------------------------------------------------
0 | model | BartForConditionalGeneration | 123 M
-------------------------------------------------------
123 M     Trainable params
0         Non-trainable params
123 M     Total params
495.440   Total estimated model params size (MB)
Validation sanity check:   0%|                                                                   | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 233, in <module>
    trainer.fit(model, dm)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 460, in fit
    self._run(model)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 758, in _run
    self.dispatch()
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 799, in dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\accelerators\accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 809, in run_stage
    return self.run_train()
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 844, in run_train
    self.run_sanity_check(self.lightning_module)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in run_sanity_check
    self.run_evaluation()
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\trainer.py", line 967, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 174, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\accelerators\accelerator.py", line 226, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 161, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "train.py", line 195, in validation_step
    outs = self(batch)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "train.py", line 185, in forward
    labels=inputs['labels'], return_dict=True)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\transformers\models\bart\modeling_bart.py", line 1295, in forward
    return_dict=return_dict,
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\transformers\models\bart\modeling_bart.py", line 1157, in forward
    return_dict=return_dict,
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\transformers\models\bart\modeling_bart.py", line 748, in forward
    inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\modules\sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "C:\Users\Newrun\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\torch\nn\functional.py", line 1852, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)

์ •์ƒ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ์š”??ใ…œใ…œ
๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค

Multi-GPU

์•ˆ๋…•ํ•˜์„ธ์š”! ๐Ÿ˜„

์ข‹์€ ์ฝ”๋“œ ์ •๋ง ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๋•๋ถ„์— ํ•œ๊ตญ์–ด large-scale model์— ๋Œ€ํ•ด ์—ด์‹ฌํžˆ ๊ณต๋ถ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹คใ…Žใ…Ž

ํ˜น์‹œ NSMC benchmark classification task ์ฝ”๋“œ์ธ nsmc.py ์ฝ”๋“œ์—์„œ

multi-gpu ์‚ฌ์šฉ ์„ธํŒ…์€ ์–ด๋–ป๊ฒŒ ํ•ด์•ผ๋˜๋Š”์ง€ ์—ฌ์ญค๋ณผ ์ˆ˜ ์žˆ์„๊นŒ์š”?

--gpus์˜ ์ˆซ์ž๋ฅผ ๋‹จ์ˆœํžˆ ๋ฐ”๊พธ๋ฉด ์˜ค๋ฅ˜๊ฐ€ ๋‚˜๊ณ ,

(AttributeError: Can't pickle local object 'get_cosine_schedule_with_warmup..lr_lambda')

pytorch์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜์ •ํ•˜๊ณ  ์‹ถ์€๋ฐ ๋””๋ฒ„๊ฑฐ ๋ชจ๋“œ๋กœ ์ฐ์–ด๋ด๋„ ์–ด๋ ค์›€์ด ์žˆ์–ด์„œ ์ด์Šˆ ๋‚จ๊น๋‹ˆ๋‹ค

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

Multi class classification

์•ˆ๋…•ํ•˜์„ธ์š”!!

์šฐ์„  ์ข‹์€ ์ฝ”๋“œ ์ œ๊ณตํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

ValueError: The highest label in `target` should be smaller than the size of the `C` dimension of `preds`.

nsmc.py ์ฝ”๋“œ๋กœ 7๊ฐ€์ง€ ํด๋ž˜์Šค ๋ถ„๋ฅ˜๋ฅผ ํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. validation step์—์„œ ์œ„์™€ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์˜ output ์ฐจ์›์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Pre-training objectives ๊ด€๋ จ ์งˆ๋ฌธ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š”.
์†Œ๊ฐœํŽ˜์ด์ง€์—๋Š” Text Infilling์— ๋Œ€ํ•œ ์„ค๋ช…๋งŒ ์žˆ์–ด์„œ ํ™•์ธ์ฐจ ์งˆ๋ฌธ ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

  • BART๋…ผ๋ฌธ์—์„œ์™€ ๊ฐ™์ด
    Pre-training objective๋ฅผ Text Infilling + Sentence Shuffling ์กฐํ•ฉ์œผ๋กœ training ์‹œํ‚จ๊ฒŒ ๋งž๋Š”์ง€์š”?

  • ๊ทธ๋ฆฌ๊ณ  ํ˜น์‹œ ๋ฐ์ดํ„ฐ ์ค‘ Other corpus(0.27B)์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ
    ์กฐ๊ธˆ ๋” ์ž์„ธํ•œ ํ•ญ๋ชฉ์ด๋‚˜ ๋น„์œจ ์ •๋ณด๋ฅผ ์•Œ๋ ค์ฃผ์‹ค ์ˆ˜ ์žˆ์œผ์‹ค๊นŒ์š”?
    ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜์—์„œ ๋Œ€ํ™”, ๋‰ด์Šค ์™ธ์— ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์…จ๋Š”์ง€, ๊ฐ ๋ฐ์ดํ„ฐ์˜ ๊ฐœ๋žต์ ์ธ ๋น„์œจ์€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

์ข‹์€ ๋ชจ๋ธ ๊ณต๊ฐœํ•ด์ฃผ์…”์„œ ์ •๋ง ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

BartForSequenceClassification์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๐Ÿ› Bug

BartForSequenceClassification ์ด ์ œ๋Œ€๋กœ ๋™์ž‘ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Tutorial์—์„œ model๋งŒ ์‚ด์ง ๋ฐ”๊ฟจ๋Š”๋ฐ ์—๋Ÿฌ๊ฐ€ ๋‚˜๋„ค์š”.
Original(English) Bart๋Š” ์ž˜ ๋™์ž‘ํ•˜๋Š”๋ฐ, KoBART๋Š” ์•ˆ ๋˜๋„ค์š”.. eos token ๊ด€๋ จ ๋ฌธ์ œ์ธ ๊ฒƒ ๊ฐ™์€๋ฐ, ํ™•์ธ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

To Reproduce

from transformers import BartTokenizer, BartForSequenceClassification
from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
from transformers.models.bart.modeling_bart import *

tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = BartForSequenceClassification.from_pretrained("facebook/bart-base")

inputs = tokenizer(["Hello, my dog is cute"], return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(inputs['input_ids'], labels=labels)
print(outputs.logits)

kobart_tokenizer = get_kobart_tokenizer()
ko_model = BartForSequenceClassification.from_pretrained(get_pytorch_kobart_model())

inputs = kobart_tokenizer(['์•ˆ๋…•ํ•˜์„ธ์š”.'], return_tensors='pt')
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
ko_model(inputs['input_ids'], labels=labels)

๋ฒ„๊ทธ๋ฅผ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์žฌํ˜„์ ˆ์ฐจ๋ฅผ ์ž‘์„ฑํ•ด์ฃผ์„ธ์š”.

  1. BART์™€ KoBART ์„ค์น˜ ํ›„ ์‹คํ–‰

Expected behavior

์ •์ƒ์ ์œผ๋กœ outputs์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์ด ๋‚˜์™€์•ผ ํ•˜๋Š”๋ฐ...์•ˆ ๋‚˜์˜ค๋„ค์š”

Environment

Colab๊ณผ Local ๋ชจ๋‘์—์„œ ํ…Œ์ŠคํŠธ
Local ํ™˜๊ฒฝ
Package Version


absl-py 1.0.0
cachetools 5.0.0
certifi 2021.10.8
charset-normalizer 2.0.11
click 8.0.3
colorama 0.4.4
cycler 0.11.0
filelock 3.4.2
fonttools 4.29.1
google-auth 2.6.0
google-auth-oauthlib 0.4.6
grpcio 1.43.0
huggingface-hub 0.4.0
idna 3.3
importlib-metadata 4.10.1
joblib 1.1.0
kiwisolver 1.3.2
Markdown 3.3.6
matplotlib 3.5.1
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
nltk 3.6.7
numpy 1.21.2
oauthlib 3.2.0
olefile 0.46
packaging 21.3
Pillow 8.4.0
pip 21.2.2
protobuf 3.19.4
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 3.0.7
python-dateutil 2.8.2
PyYAML 6.0
regex 2022.1.18
requests 2.27.1
requests-oauthlib 1.3.1
rouge-score 0.0.4
rsa 4.8
sacremoses 0.0.47
setuptools 58.0.4
six 1.16.0
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tokenizers 0.11.4
torch 1.10.2
torchaudio 0.10.2
torchvision 0.11.3
tqdm 4.62.3
transformers 4.16.2
typing-extensions 3.10.0.2
urllib3 1.26.8
Werkzeug 2.0.2
wheel 0.37.1
wincertstore 0.2
zipp 3.7.0

Additional context

KoBART Tokenizer๋ฅผ ์‚ฌ์šฉํ•œ ํ•œ๊ตญ์–ด QA Dataset์˜ Model Training ์งˆ๋ฌธ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์•ˆ๋…•ํ•˜์„ธ์š”.
์šฐ์„  ์ข‹์€ ์ž๋ฃŒ ๊ณต์œ ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

ํ˜„์žฌ KoBART๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ QA task๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ, Transformer์—์„œ KoBART Tokenizer์™€ AutoModelForQuestionAnswering ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ KorQuAD 1.0 Dataset์„ Training ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ Training ๊ณผ์ •์—์„œ model save๋ฅผ ํ•  ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ ์งˆ๋ฌธ ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

[INFO|configuration_utils.py:329] 2021-04-28 16:20:12,610 >> Configuration saved in models/checkpoint-45000/config.json44998/56471 [5:26:32<1:22:42, 2.31it/s]
[INFO|modeling_utils.py:848] 2021-04-28 16:20:13,520 >> Model weights saved in models/checkpoint-45000/pytorch_model.bin
[INFO|tokenization_utils_base.py:1918] 2021-04-28 16:20:13,521 >> tokenizer config file saved in models/checkpoint-45000/tokenizer_config.json
[INFO|tokenization_utils_base.py:1924] 2021-04-28 16:20:13,521 >> Special tokens file saved in models/checkpoint-45000/special_tokens_map.json
[WARNING|tokenization_gpt2.py:288] 2021-04-28 16:20:13,619 >> Saving vocabulary to models/checkpoint-45000/merges.txt: BPE merge indices are not consecutive. Please check that the tokenizer is not corrupted!
[WARNING|tokenization_gpt2.py:288] 2021-04-28 16:20:13,620 >> Saving vocabulary to models/checkpoint-45000/merges.txt: BPE merge indices are not consecutive. Please check that the tokenizer is not corrupted!
...

model save๋ฅผ ๋‹ด๋‹นํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์‚ดํŽด๋ณด๋‹ˆ merge.txt๋ฅผ ์ž‘์„ฑํ•  ๋•Œ์— bpe_tokens๋ฅผ ๋ฝ‘์•„๋‚ด๊ฒŒ ๋˜๋Š”๋ฐ, ์ด ๋•Œ์— index์™€ token_index๊ฐ€ ๊ฐ™์€ value๊ฐ€ ์•„๋‹ˆ๊ฒŒ ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด ์˜ค๋ฅ˜๊ฐ€ Tokenization ์˜ค๋ฅ˜์ธ์ง€ ํ˜น์€ model ์ž์ฒด์˜ ์˜ค๋ฅ˜์ธ์ง€ ํ˜น์‹œ ํ™•์ธ ๊ฐ€๋Šฅํ• ๊นŒ์š”?

Training์€ ์ด ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด์„œ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค.
๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

KoBART-translation Demo ์ „ ๋ฌธ์˜

์•ˆ๋…•ํ•˜์„ธ์š”!
https://github.com/seujung/KoBART-translation ๊ด€๋ จ ๋‚ด์šฉ์ด ๋ถ€์กฑํ•˜์—ฌ ๋ฌธ์˜ํ•˜๋Ÿฌ ์˜ค๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
https://github.com/seujung/KoBART-translation ์— ๋”ฐ๋ผ train ์ง„ํ–‰ ํ›„ ๋ฐ๋ชจ๋ฅผ ์œ„ํ•ด

python get_model_binary.py --hparams ./logs/tb_logs/default/version_0/hparams.yaml --model_binary ./logs/kobart_translation-model_chp/epoch=49-val_loss=9.253.ckpt
๊นŒ์ง€ ํ•œ ์ƒํ™ฉ์ธ๋ฐ์š”,
ํ•ด๋‹น ๋ช…๋ น์–ด์˜ output์œผ๋กœ ./translation_binary์ด ๋‚˜์™€์•ผํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, config.json๊ณผ pytorch_model.bin ์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค.

์ดํ›„ streamlit run infer.py๋ฅผ ์‹คํ–‰ํ•˜์—ฌ Demo page๋ฅผ ์—ด๋ฉด
image
OSError: Can't load the configuration of './translation_binary'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './translation_binary' is the correct path to a directory containing a config.json file Traceback: File "/home/lib/python3.8/site-packages/streamlit/script_runner.py", line 332, in _run_script exec(code, module.__dict__) File "/home/KoBART-translation/infer.py", line 12, in <module> model = load_model() File "/home/lib/python3.8/site-packages/streamlit/caching.py", line 604, in wrapped_func return get_or_create_cached_value() File "/home/lib/python3.8/site-packages/streamlit/caching.py", line 588, in get_or_create_cached_value return_value = func(*args, **kwargs) File "/home/KoBART-translation/infer.py", line 8, in load_model model = BartForConditionalGeneration.from_pretrained('./translation_binary') File "/home/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2175, in from_pretrained config, model_kwargs = cls.config_class.from_pretrained( File "/home/lib/python3.8/site-packages/transformers/configuration_utils.py", line 546, in from_pretrained config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/home/lib/python3.8/site-packages/transformers/configuration_utils.py", line 573, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/home/lib/python3.8/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict raise EnvironmentError(

์ด์™€ ๊ฐ™์ด ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

์ดํ›„๋กœ ๋ถ€ํ„ฐ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•ด์•ผํ• ์ง€ ๊ฐˆํ”ผ๋ฅผ ์žก์ง€ ๋ชปํ•ด, ๋„์›€์„ ์–ป๊ณ ์ž ๋ฌธ์˜๋“œ๋ฆฝ๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Pretrain ์‹œ input๊ตฌ์กฐ์— ๋Œ€ํ•ด์„œ ์งˆ๋ฌธ๋“œ๋ฆฝ๋‹ˆ๋‹ค!

Huggingface์˜ KoBART ๋ชจ๋ธ์ชฝ์„ ๋ณด๋‹ˆ, decoder_input_ids๊ฐ€ ์ž…๋ ฅ๊ฐ’์„ ๋ฐ›์ง€ ์•Š์„ ๋•Œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธํ–ˆ๋Š”๋ฐ,

labels: t1, t2, t3,..., tn, eos, pad, pad
decoder_input_ids: eos, t1, t2, t3, 

ํ˜น์‹œ pretrain์‹œ decoder_input_ids๋ฅผ ์ง์ ‘ input๋„ฃ์–ด์ค„๋•Œ ๋„ฃ์—ˆ๋Š”์ง€, ํ˜น์€ input_ids, labels๋งŒ ๋„ฃ์—ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์„๊นŒ์š”?
์ถ”๊ฐ€๋กœ ์ž…๋ ฅ์‹œ input_ids์™€ labels์˜ ๊ตฌ์กฐ์— ๋Œ€ํ•ด์„œ ์ •ํ™•ํžˆ ์•Œ ์ˆ˜ ์žˆ์„๊นŒ์š”

ex)

input_ids = bos, t1, ..., tn, eos, pad, ...(maskํฌํ•จ)
labels = bos, t1, t2, ..., tn, eos, pad, ...
decoder_input_ids(์ง์ ‘ ๋„ฃ์–ด์คฌ๋‹ค๋ฉด) = bos, t1, t2, ..., tn, eos

๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ชจ๋ธ์„ ๊ณต์œ ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

pretrain/finetune์‹œ ์‚ฌ์šฉํ–ˆ๋˜ gpu

์•ˆ๋…•ํ•˜์„ธ์š”! ์ข‹์€ ๋ชจ๋ธ ๊ณต๊ฐœํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

pretrain ๋ฐ finetune ๋‹น์‹œ ์‚ฌ์šฉํ•˜์…จ๋˜ gpu ์ŠคํŽ™์„ ํ˜น์‹œ ๊ณต์œ ํ•ด์ฃผ์‹ค ์ˆ˜ ์žˆ์œผ์‹ ๊ฐ€์š”?

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

๊ฐ€์ค‘์น˜ ๋‹ค์šด๋กœ๋“œ ์„œ๋ฒ„์˜ ๋„๋ฉ”์ธ์— ๋ฌธ์ œ ๋ฐœ์ƒ

์•ˆ๋…•ํ•˜์„ธ์š”, ์ข‹์€ ๋ชจ๋ธ๊ณผ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐฐํฌํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
๋‹ค๋งŒ ํ•™์Šต ๊ฐ€์ค‘์น˜ ๋„๋ฉ”์ธ์ด ํ„ฐ์ ธ ํ˜„์žฌ ํ† ํฐ๋‚˜์ด์ € ๋ฐ ๊ฐ€์ค‘์น˜๊ฐ€ ๋‹ค์šด์ด ์•ˆ๋˜๋Š” ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค.

dns

ํ˜น์‹œ ํ™•์ธํ•ด ์ฃผ์‹ค ์ˆ˜ ์žˆ์œผ์‹ ๊ฐ€์š”?

AttributeError: module 'torch' has no attribute 'QUInt4x2Storage'

์•ˆ๋…•ํ•˜์„ธ์š”
์ข‹์€์†Œ์Šค ๊ณต์œ ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค

kobart translator ์‚ฌ์šฉ์ค‘์— ํ•ด๋‹น ์ด์Šˆ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค
์›์ธ์„ ์ฐพ๋˜์ค‘ kobart install ์ดํ›„์— ํ•ด๋‹น ์ด์Šˆ๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‹œ์ž‘ํ•จ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค

Torch == 1.7.1
pytorch-lightning == 1.1.0
python == 3.7.0

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/test_env/lib/python3.7/site-packages/torch/__init__.py", line 500, in <module>
    _C._initExtension(manager_path())
AttributeError: module 'torch' has no attribute 'QUInt4x2Storage'

์–ด๋–ค ์˜ค๋ฅ˜์ธ์ง€ ํ™•์ธ๊ฐ€๋Šฅํ• ๊นŒ์š”??
์ข‹์€ ์ž๋ฃŒ ๊ณต์œ ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

example ์— ์žˆ๋Š” nsmc ๋กœ ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋‚˜์š”?

example ์— ์žˆ๋Š” nsmc.py ๋ฅผ ํ™œ์šฉํ•ด์„œ 8๊ฐœ ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ชจ๋ธ์„ ํŒŒ์ธํŠœ๋‹ ํ–ˆ์Šต๋‹ˆ๋‹ค.
ํ•™์Šต ๊ฒฐ๊ณผ๋กœ ckpt ํŒŒ์ผ๊ณผ hparams.yaml ํŒŒ์ผ์„ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝ task ํ•  ๋•Œ๋Š” KoBart summarization git ์—์„œ ์˜ˆ์ œ ์ฝ”๋“œ๋ฅผ ์–ป์–ด์„œ ํŒŒ์ธํŠœ๋‹ํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ์—ˆ๋Š”๋ฐ์š”.
nsmc ์ชฝ์€ ์‚ฌ์šฉ ์˜ˆ์ œ๊ฐ€ ์—†์–ด์„œ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด์„œ ๋ฌธ์˜๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ์˜ˆ์ œ๋‚˜ ์ฐธ๊ณ ํ• ๋งŒํ•œ ๋งํฌ๋ผ๋„ ๊ณต์œ ํ•ด์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์–ด๋–ป๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ฝ”๋“œ์ข€ ์•Œ๋ ค์ฃผ์‹ค์ˆ˜ ์žˆ์„๊นŒ์š”?

from kobart import get_kobart_tokenizer
kobart_tokenizer = get_kobart_tokenizer()
kobart_tokenizer.tokenize("์•ˆ๋…•ํ•˜์„ธ์š”. ํ•œ๊ตญ์–ด BART ์ž…๋‹ˆ๋‹ค.๐Ÿคฃ:)l^o")
['โ–์•ˆ๋…•ํ•˜', '์„ธ์š”.', 'โ–ํ•œ๊ตญ์–ด', 'โ–B', 'A', 'R', 'T', 'โ–์ž…', '๋‹ˆ๋‹ค.', '๐Ÿคฃ', ':)', 'l^o']

๋„๋ฌด์ง€ ์ดํ•ด๊ฐ€ ์•ˆ๋˜๋„ค์š”
3์ค„๊นŒ์ง€ ์ž‘์„ฑํ•˜๊ณ 
4์ค„์— print(kobart_tokenizer) ๋ฅผ ํ•˜๋ฉด ๊ด„ํ˜ธ [] ๋ณด๊ธฐ์™€ ๊ฐ™์ด ๋‚˜์˜ค๋Š”๊ฒŒ ์•„๋‹Œ๊ฐ€์š”;;

Dropping <unk> token

์•„๋ž˜์™€ ๊ฐ™์ด ํ† ํฐ์„ ๋ˆ„๋ฝ์‹œํ‚ค๋Š” ๋ฒ„๊ทธ๊ฐ€ ์กด์žฌํ•จ.

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("abํ—ฃใ‰ฟcde")
['โ–', 'ab', 'c', 'd', 'e']

[FEATURE] koBART-large ๋ฐฐํฌ ์—ฌ๋ถ€

๐Ÿš€ Feature

koBART-large ๋ฐฐํฌ

Motivation

ํ˜„์žฌ ๋ฆด๋ฆฌ์ฆˆ๋Š” BART-base ๊ธฐ๋ฐ˜์ธ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. BART์˜ ๊ฒฝ์šฐ base ๋ฟ๋งŒ ์•„๋‹Œ large ๋ชจ๋ธ๋„ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด pretrain๋œ large ํฌ๊ธฐ ๋ชจ๋ธ ๋˜ํ•œ ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด ๊ณต๊ฐœํ•ด์ฃผ์‹ ๋‹ค๋ฉด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋  ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

Pitch

BART-large ๋ชจ๋ธ ๊ตฌํ˜„ ํ˜น์€ huggingface์—์„œ ๋ถˆ๋Ÿฌ ์˜จ ๋‹ค์Œ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์…‹์— ๋Œ€ํ•ด ํ•™์Šต

Additional context

pip ์„ค์น˜ ์‹œ ์—๋Ÿฌ: Command errored out with exit status 128

$ pip install git+https://github.com/SKT-AI/KoBART#egg=kobart --no-cache-dir -U turicreate

๊ณผ ๊ฐ™์ด ์„ค์น˜ ์‹œ๋„ ์‹œ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค(์–ด๋Š OS์—์„œ๋“  ๋น„์Šท). ๋„์›€์„ ์ฃผ์‹ค ์ˆ˜ ์žˆ์„๊นŒ์š”?

WARNING: Discarding git+https://github.com/SKT-AI/KoBART#egg=kobart. Command errored out with exit status 128: git clone -q https://github.com/SKT-AI/KoBART /private/var/folders/7d/8j9vj0c541z7y9gs0_3hdsrr0000gn/T/pip-install-89_v3_4e/kobart_aa1496bd8fd04a869d9a9e1e3fbe5782 Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement kobart (unavailable)
ERROR: No matching distribution found for kobart (unavailable)

save_pretrained() ์— NotImplemented Error ๋ฐœ์ƒ

์•ˆ๋…•ํ•˜์„ธ์š”. ๋จผ์ € ์ข‹์€ ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ์˜ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋„์ค‘ ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ ์ด์Šˆ๋ฅผ ๋‚จ๊น๋‹ˆ๋‹ค.

ํ™˜๊ฒฝ

  • Python 3.6.9
  • Transformers 4.3.3
  • torch 1.7.1+cu110
  • CUDA Version 11.2
  • KoBART 0.4 ( git+https://github.com/SKT-AI/KoBART#egg=kobart ๋กœ pip install ์ง„ํ–‰)
  • ๋ฐ์ดํ„ฐ์…‹์€ ์ž์ฒด์ ์œผ๋กœ ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ

์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ ์ฝ”๋“œ

(์ผ๋ถ€ ๋ถˆํ•„์š”ํ•œ ๋‚ด์šฉ์€ ์ƒ๋žตํ•˜์˜€์Šต๋‹ˆ๋‹ค)

import kobart
from transformers import AutoTokenizer
if not os.path.exists("./kobart_tokenizer"):
    os.makedirs("./kobart_tokenizer")
dummy_tokenizer = kobart.get_kobart_tokenizer() 
dummy_tokenizer.save_pretrained("./kobart_tokenizer/") # ์ด๊ณณ์—์„œ ์—๋Ÿฌ ๋ฐœ์ƒ
tokenizer = AutoTokenizer.from_pretrained("./kobart_tokenizer")

์˜ค๋ฅ˜ ๋‚ด์šฉ

Traceback (most recent call last):
  File "misc.py", line 23, in <module>
    dummy_tokenizer.save_pretrained("./kobart_tokenizer/")
  File ".../venv/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1992, in save_pretrained
    filename_prefix=filename_prefix,
  File ".../venv/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 535, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File ".../venv/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2044, in save_vocabulary
    raise NotImplementedError
NotImplementedError

์ถ”๊ฐ€ ์„ค๋ช…

์ด ์˜ค๋ฅ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฝ”๋“œ์—์„œ ๋ฐœ์ƒํ•œ pickle ์—๋Ÿฌ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋˜ ๋„์ค‘ ๋ฐœ๊ฒฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

tokenizer = kobart.get_kobart_tokenizer()
def preprocess(sample):
    global tokenizer
    return tokenizer(sample)

# ์ƒ๋žต

def main():
    # ์ƒ๋žต
    dataset = dataset.map(preprocess, batched=True)

๋ฅผ ์ง„ํ–‰ํ•  ์‹œ, map ์ด ์žˆ๋Š” ๋ผ์ธ์—์„œ pickling Error๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. (์—๋Ÿฌ ๋‚ด์šฉ์ด ๋„ˆ๋ฌด ๊ธธ์–ด ์ผ๋ถ€ ์ƒ๋žตํ•˜๊ณ  ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„๋งŒ ๊ธฐ์ž…ํ•ฉ๋‹ˆ๋‹ค)

File ".../venv/lib/python3.6/pickle.py", line 927, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>: it's not the same object as transformers.tokenization_utils_fast.PreTrainedTokenizerFast

์ด ๋ฌธ์ œ๋ฅผ ํšŒํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘์— ํ•˜๋‚˜๋กœ, preprocess(sample) ์•ˆ์—์„œ get_kobart_tokenizer()๋ฅผ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค๋งŒ ๋งค๋ฒˆ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ฒŒ ๋˜์–ด ๊ต‰์žฅํžˆ ๋Š๋ฆฌ๊ณ  ๋น„ํšจ์œจ์ ์ด๋ผ๊ณ  ํŒ๋‹จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
huggingface์—์„œ ์ œ๊ณตํ•˜๋Š” ๋‹ค๋ฅธ tokenizer์˜ ๊ฒฝ์šฐ, ์œ„์™€ ๊ฐ™์ด ํ•˜๋‚˜์˜ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๋ฏธ๋ฆฌ ์„ ์–ธํ•ด๋‘” ํ›„ map์—์„œ ์‚ฌ์šฉํ•˜์—ฌ๋„ ์•„๋ฌด๋Ÿฐ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  tokenizer๋ฅผ ์ผ๋‹จ ๋ถˆ๋Ÿฌ์˜จ ํ›„, save_pretrained() ๋ฅผ ํ•˜๊ณ  ๋‹ค์‹œ AutoTokenizer.from_pretrained() ๋กœ ๊ฐ€์ ธ์™€์„œ ์‹คํ–‰ํ•ด๋ณด๋ ค๋˜ ์ค‘ NotImplementedError๊ฐ€ ๋ฐœ์ƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.
๋งฅ๋ฝ์ด ๋‹ค๋ฅผ ์ˆ˜๋Š” ์žˆ์ง€๋งŒ ์•„๋ฌด๋ž˜๋„ not the same object as ...PreTrainedTokenizerFast ๋ถ€๋ถ„์ด, ํ† ํฌ๋‚˜์ด์ €์˜ ์ผ๋ถ€๊ฐ€ ๊ตฌํ˜„๋˜์ง€ ์•Š์•„ ์ƒ๊ธฐ๋Š” ์˜ค๋ฅ˜์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ƒ๊ฐ์— ์ด์Šˆ์— ๊ฐ™์ด ๊ธฐ์ž…ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

[FEATURE] migrate model, tokenizer, and dataset storage to `AWS S3`

๐Ÿš€ Feature

model, tokenizer, dataset ์ €์žฅ์†Œ ๋ณ€๊ฒฝ

Motivation

์•ˆ์ •์ ์ธ ๊ด€๋ฆฌ, ๋‹ค์šด๋กœ๋“œ ์†๋„ ๋ณด์žฅ์„ ์œ„ํ•ด, ์Šคํ† ๋ฆฌ์ง€๋ฅผ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

Pitch

  • azure, dropbox, github ๋“ฑ์— ์ €์žฅ๋œ ํŒŒ์ผ๋“ค์„ aws s3 ๋กœ ์—…๋กœ๋“œ
  • download ํ•จ์ˆ˜๋ฅผ aws ์šฉ์œผ๋กœ ๋ณ€๊ฒฝ

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.