ai-forever / ru-gpts Goto Github PK

View Code? Open in Web Editor NEW

2.1K 88.0 446.0 392 KB

Russian GPT3 models.

License: Apache License 2.0

Python 98.17% Shell 1.83%

deep-learning gpt3 language-model russian russian-language transformers

ru-gpts's Introduction

Russian GPT-3 models

ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large

This repository contains bunch of autoregressive transformer language models trained on a huge dataset of russian language.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Large) trained with 1024 sequence length.
Try Model Generation In Colab! ruGPT-3 XL: or ruGPT-3 smaller models:
Usage examples are described in detail here. See how fine-tuning works:

ruGPT3XL
- Setup
- Usage
- Finetune
- Pretraining details ruGPT3XL
ruGPT3Large, ruGPT3Medium, ruGPT3Small, ruGPT2Large
- Setup
- Usage
- Pretraining details
Papers mentioning ruGPT3
OpenSource Solutions with ruGPT3

ruGPT3XL

Setup

For colab we recommend use the following installation instructions:

export LD_LIBRARY_PATH=/usr/lib/
apt-get install clang-9 llvm-9 llvm-9-dev llvm-9-tools
git clone https://github.com/qywu/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install triton
DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed
pip install transformers
pip install huggingface_hub
pip install timm==0.3.2
git clone  https://github.com/sberbank-ai/ru-gpts
cp ru-gpts/src_utils/trainer_pt_utils.py /usr/local/lib/python3.8/dist-packages/transformers/trainer_pt_utils.py
cp ru-gpts/src_utils/_amp_state.py /usr/local/lib/python3.8/dist-packages/apex/amp/_amp_state.py

After installation env please restart colab. For checking is all ok, run the following commands:

!ds_report
# Output:
...
sparse_attn ............ [YES] ...... [OKAY]
...
import deepspeed.ops.sparse_attention.sparse_attn_op

Usage

Here is a simple example of usage. For more see this example or .

import sys
from src.xl_wrapper import RuGPT3XL
import os

# If run to from content root.
sys.path.append("ru-gpts/")
os.environ["USE_DEEPSPEED"] = "1"
# We can change address and port
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "5000"
gpt = RuGPT3XL.from_pretrained("sberbank-ai/rugpt3xl", seq_len=512)
gpt.generate(
    "Кто был президентом США в 2020? ",
    max_length=50,
    no_repeat_ngram_size=3,
    repetition_penalty=2.,
)

Finetuning

Example of finetune, load finetuned model and generate is here.

Our example of finetuning script here

Pretraining details ruGPT3XL

Model was trained with 512 sequence length using Deepspeed and Megatron code by Devices team, on 80B tokens dataset for 4 epochs. After that model was finetuned 1 epoch with sequence length 2048.
Note! Model has sparse attention blocks.

Total training time was around 10 days on 256 GPUs.
Final perplexity on test set is 12.05.

🤗HuggingFace model card link.

ruGPT3Large, ruGPT3Medium, ruGPT3Small, ruGPT2Large

Setup

For using ruGPT3Large, ruGPT3Medium, ruGPT3Small, ruGPT2Large just install 🤗HuggingFace transformers.

pip install transformers==4.24.0

Usage

Here we can obtain examples of finetuning or generation.

Also this examples is adapted for google colab:

finetuning: .
generation:

from transformers import GPT2LMHeadModel, GPT2Tokenizer


model_name_or_path = "sberbank-ai/rugpt3large_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).cuda()
text = "Александр Сергеевич Пушкин родился в "
input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
out = model.generate(input_ids.cuda())
generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)
# Output should be like this:
# Александр Сергеевич Пушкин родился в \n1799 году. Его отец был крепостным крестьянином, а мать – крепостной крестьянкой. Детство и юность Пушкина прошли в деревне Михайловское под Петербургом. В 1820-х годах семья переехала

Pretraining details

All pretraining was done on Nvidia Tesla V100-SXM3 32 Gb GPUs on a Christofari Cluster. Following are the details of pretraining for each model.

Pretraining details ruGPT3Large

Model was trained with sequence length 1024 using transformers lib by Devices team on 80B tokens for 3 epochs. After that model was finetuned 1 epoch with sequence length 2048.

Total training time was around 14 days on 128 GPUs for 1024 context and few days on 16 GPUs for 2048 context.
Final perplexity on test set is 13.6.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3large_based_on_gpt2.

🤗HuggingFace model card link

Our pretraining script here

Pretraining details ruGPT3Medium

Model was trained with sequence length 1024 using transformers lib by Devices team on 80B tokens for 3 epoch. After that model was finetuned on 2048 context.

Total training time was around 16 days on 64 GPUs.
Final perplexity on test set is 17.4.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3medium_based_on_gpt2.

🤗HuggingFace model card link

Our pretraining script here

Pretraining details ruGPT3Small

Model was trained with sequence length 1024 using transformers by Devices team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3small_based_on_gpt2.

🤗HuggingFace model card link

Our pretraining script here

Pretraining details ruGPT2Large

Model was trained with sequence length 1024 using transformers by Devices team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model by using transformers with model name sberbank-ai/rugpt2large.

🤗HuggingFace model card link

OpenSource Solutions with ruGPT3

ruCLIP Github
Simplification with ruGPT-3 XL Github
Word normalization (RuNormAS shared task) Github
AI CopyWriter Github
ЕГЭ Generation Github
NeuroZhirinovsky Github
PseudoKant Github
DostoevskyDoesntWriteIt Github

Papers mentioning ruGPT3

According to google scholar search - feel free to add links to this list

Text Simplification

@article{shatilovsentence,
  title={Sentence simplification with ruGPT3},
  author={Shatilov, AA and Rey, AI},
  url={http://www.dialog-21.ru/media/5281/shatilovaaplusreyai142.pdf}
}

@article{fenogenovatext,
  title={Text Simplification with Autoregressive Models},
  author={Fenogenova, Alena},
  url={http://www.dialog-21.ru/media/5250/fenogenovaa141.pdf}}

Text Detoxification

@article{dementieva2021methods,
  title={Methods for Detoxification of Texts for the Russian Language},
  author={Dementieva, Daryna and Moskovskiy, Daniil and Logacheva, Varvara and Dale, David and Kozlova, Olga and Semenov, Nikita and Panchenko, Alexander},
  journal={arXiv preprint arXiv:2105.09052},
  year={2021},
  url={https://arxiv.org/abs/2105.09052}
}

Paraphrasing and Data Augmentation

@inproceedings{fenogenova2021russian,
  title={Russian Paraphrasers: Paraphrase with Transformers},
  author={Fenogenova, Alena},
  booktitle={Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing},
  pages={11--19},
  year={2021},
  url={https://www.aclweb.org/anthology/2021.bsnlp-1.2.pdf}
}

Model Evaluation

@article{malykh2021morocco,
  title={MOROCCO: Model Resource Comparison Framework},
  author={Malykh, Valentin and Kukushkin, Alexander and Artemova, Ekaterina and Mikhailov, Vladislav and Tikhonova, Maria and Shavrina, Tatiana},
  journal={arXiv preprint arXiv:2104.14314},
  year={2021},
  url={https://arxiv.org/abs/2104.14314}}

ru-gpts's People

Contributors

Stargazers

Watchers

Forkers

tatianashavrina mgrankin namli artemsnegirev rai220 zavg gordinmitya datahack-ru kadirovazizbek ximik666 dev4jam venis vikneo2017 dimawebmaker dewival zagir mingulov ra1amx chelovekula dimwap leonidkostyushko dankunis denisismagilov rustfox unclemokus graphgrailai adworse kainovandrew autolenta disovi muspelheim mgbrsk syn4ps khmelov napetc e-neko boangri dmitryermichev d-cheremnov volodymyr4152 evgeny-t desabel efimberson anyks cxz itelmen arseniysky himikk deftro trickff vladimirshleyev zavodptica lostally dc914337 xordata tsukanov-as genesem hhy5277 mehrdad-shokri panpepson tirael cclauss poipiii nikosar vftens 154king154 fen0s ethpony piterden maximkiselev rheehot vaad2 suriyaruk neuroidss alexartuga bigsnowpanda kudddy mrzlab630 niklss maxkhk ikamil iskander-g cinemaofthegods theradioguy disant9807 p0poff sergeco ringolol vova-kondrashov sanyapalmero yudinv perevalovds 1gorq flashfoxter vkirilenko hasan2711 mrlebovsky dmred mikhail-kashirskii geonwoovincentkim

ru-gpts's Issues

Broken encoding of vocab.json

I was fine-tuning ruGPT3-Medium for QA, but there was some problems with training. After setting 50 epochs with small dataset (to be sure that I can fine-tune model) I found that the only answer was in english.

So I looked what was in the vocab.json . I found lots of broken (?) symbols with strange encoding.

I tried to change it to windows1252, windows 1251 and iso8859-5 but there was no result. Can you please explain me what I did wrong or just fix the vocab.json

не сохраняется pytorch_model.bin после половины итераций

Запустил дообучение на собственном датасете, все работает, но в чекпоинтах не всегда сохраняется pytorch_model.bin.
Например чекпоинт 10к итераций с моделью

а 11к итераций уже без модели

В чем может быть проблема?

AI

colab RuGPT3FinetuneHF.ipynb is not working

Перестал работать ноутбук для файнтюнинга на колабе: https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_RuGPTs_with_HF.ipynb
При выполнении ячейки Train ошибка:

2021-03-09 17:40:36.000492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "ru-gpts/pretrain_transformers.py", line 33, in <module>
    from transformers import (
  File "/usr/local/lib/python3.7/dist-packages/transformers/__init__.py", line 626, in <module>
    from .trainer import Trainer
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 69, in <module>
    from .trainer_pt_utils import (
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer_pt_utils.py", line 40, in <module>
    from torch.optim.lr_scheduler import SAVE_STATE_WARNING
ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py)

Суммаризация

Подскажите, как ru-gpts можно использовать для суммаризации, либо как ее обучить для этих целей. Спасибо

Not works in colab

Are notebooks in colab linked in readme expected to work?

ruGPT3XL_generation example does not work

!DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7

Collecting deepspeed==0.3.7
  Downloading https://files.pythonhosted.org/packages/1f/f6/4de24b5790621e9eb787b7e4d90a57075ebbb85e81100a0dc8c50fdba8ba/deepspeed-0.3.7.tar.gz (258kB)
     |████████████████████████████████| 266kB 7.5MB/s 
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I tried it in Colab. Any ideas how to fix?

Generate_text_with_RuGPTs_HF does not work also:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

ImportError                               Traceback (most recent call last)
<ipython-input-5-4bb89d36a3dc> in <module>()
----> 1 from transformers import GPT2LMHeadModel, GPT2Tokenizer

2 frames
/usr/local/lib/python3.7/dist-packages/transformers/__init__.py in <module>()
    624 
    625     # Trainer
--> 626     from .trainer import Trainer
    627     from .trainer_pt_utils import torch_distributed_zero_first
    628 else:

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in <module>()
     67     TrainerState,
     68 )
---> 69 from .trainer_pt_utils import (
     70     DistributedTensorGatherer,
     71     SequentialDistributedSampler,

/usr/local/lib/python3.7/dist-packages/transformers/trainer_pt_utils.py in <module>()
     38     SAVE_STATE_WARNING = ""
     39 else:
---> 40     from torch.optim.lr_scheduler import SAVE_STATE_WARNING
     41 
     42 logger = logging.get_logger(__name__)

ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py)

What GPU needed to finetune Large version?

I have 16Gb GPU and get CUDA out of memory error (for batch size = 1!):

RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 14.76 GiB total capacity; 13.25 GiB already allocated; 21.44 MiB free; 13.84 GiB reserved in total by PyTorch)

Is this memory really not enough to train the large version? May be there is some tips to reduce memory using on pretraining? I using such list of parameters:

    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --overwrite_cache \
    --num_train_epochs 2 \
    --save_steps 1000 \
    --block_size 256 \
    --fp16

CPU offload режим для GPT3XL

Добрый день.
Недавно при попытке файнтюнить самую большую GPT3XL столкнулся с ошибкой нехватки памяти. Попытался в конфиге deepspeed включить режим cpu_offload и обломался - выдаётся ошибка, см стек по ссылке:
https://gist.github.com/exelents/dd64ddd745bfa732a809a6b3e9af678d
RuntimeError: expected input to be on cuda
Вопрос - что нужно сделать чтоб данная модель завелась в режиме cpu offload и возможно ли это вообще?

How to solve this error ?

FileNotFoundError: [Errno 2] No such file or directory: 'c:\users\admin\anaconda3\Lib\venv\scripts\nt\python.exe'

RuGPT-2 Large model colab notebook

https://colab.research.google.com/drive/1lSx9P4C60umxpoROgz3b0OZfy7EcDgrg
Perhaps, it could be added in readme? Got it running with help of gradient checkpointing and change of optimizer from AdamW to SM3. Uses fork of aitextgen library.

GPT3XL: generation doesn't work

Поставил все так же как в этом ноутбуке у себя локально: https://github.com/sberbank-ai/ru-gpts/blob/master/examples/ruGPT3XL_generation.ipynb

~$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/antoly/3env/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/antoly/3env/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1

Модель нормально загружается в память, но при обращении к модели падает:

gpt("Кто был президентом США в 2020? ").logits

/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py:272: UserWarning: This overload of nonzero is deprecated:
        nonzero()
Consider using one of the following signatures instead:
        nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  nnz = layout.nonzero()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "ru-gpts/src/xl_wrapper.py", line 281, in __call__
    lm_logits = self.model(tokens, position_ids, attention_mask)
  File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "ru-gpts/src/fp16/fp16.py", line 72, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "ru-gpts/src/model/gpt3_modeling.py", line 108, in forward
    transformer_output = self.transformer(embeddings, attention_mask)
  File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "ru-gpts/src/mpu/transformer.py", line 449, in forward
    hidden_states = layer(hidden_states, attention_mask)
  File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "ru-gpts/src/mpu/transformer.py", line 301, in forward
    attention_output = self.attention(layernorm_output, ltor_mask)
  File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "ru-gpts/src/mpu/transformer.py", line 131, in forward
    attn_mask=ltor_mask)
  File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 130, in forward
    attn_output_weights = sparse_dot_sdd_nt(query, key)
  File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 746, in __call__
    time_db)
  File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 550, in forward
    c_time)
  File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 228, in _sdd_matmul
    bench=bench)
  File "/home/antoly/3env/lib/python3.6/site-packages/triton/kernel.py", line 86, in __call__
    torch.ops.triton.launch_kernel(self.op_id, device, params)
RuntimeError: CUDA: Error- invalid ptx

Единственное мое подозрение, что у меня стоит две версии CUDA 9.2 и 10.1. Я везде настроил пути на 10.1, но возможно все же triton смотрит на 9.2. Возможно вы сталкивались с такой ошибкой?

Возможно нужно поставить CUDNN? У меня для 10.1 не стоит

Is it ruGPT3Large or Setup ruGPT2Large

Why is it referred by Sberbank as ruGPT-3 everywhere including readme file, but the model itself is named gpt2_large_bbpe_v50
That's correct? We should be using gpt2_large_bbpe_v50? this reference as gpt2 and gpt3 is confusing.

No bos token

Is it possible to generate text using an empty prompt or calculate the probability of the first token?
There is no bos_token_id defined, therefore I do not see a natural way to do it.
See https://colab.research.google.com/drive/1JvOSnKU4Mn1VtHoYXQ3rDXLRbOWwtRds#scrollTo=BjHAdxfyYw13 for an example.

If I add <s> as the first token, the probability distribution for the first word is weird (see below), which means it does not solve the problem.

import torch
from torch.nn import Softmax

softmax = Softmax(dim=1)
text = "В Москве стоит довольно хорошая погода."
with torch.no_grad():
    encoded_input = tokenizer(text, return_tensors='pt', add_special_tokens=True)["input_ids"]
    encoded_input = torch.cat([torch.LongTensor([[tokenizer.bos_token_id]]), encoded_input, torch.LongTensor([[tokenizer.eos_token_id]])], dim=1)
    print(*[tokenizer.decode(x) for x in encoded_input[0]])
    outputs = model(encoded_input)
    probs = softmax(outputs.logits[0])

values, indexes = torch.topk(probs, dim=1, k=20)
values, indexes = values.numpy(), indexes.numpy()

for i, (curr_probs, curr_indexes) in enumerate(zip(values, indexes)):
    print(i)
    for index, prob in zip(curr_indexes, curr_probs):
        print(f"{tokenizer.decode([index]).rstrip()}:{prob:.2f}", end=" ")
    print("")

The output is

0
:0.05 .:0.05 :0.05 ;:0.03 ,:0.02  in:0.02 :0.02  and:0.01  the:0.01 {:0.01 \:0.01 [:0.01  a:0.01 ':0.01  of:0.01  \:0.01  to:0.01 :0.01 ::0.01  as:0.01 
1
.:0.02  этом:0.02 нимание:0.01 первые:0.01  общем:0.01 спом:0.01  конце:0.01 роде:0.01 зя:0.01  том:0.01 месте:0.01  этой:0.01  России:0.01 нутри:0.01 ход:0.01 торая:0.00  связи:0.00 сю:0.00 &:0.00  начале:0.00 
2
,:0.08  в:0.05  на:0.04 .:0.02  и:0.02  есть:0.01  с:0.01 ::0.01  -:0.01  не:0.01  у:0.01  уже:0.01  был:0.01  я:0.01 &:0.01  по:0.01  было:0.01  (:0.01  за:0.00  была:0.00 
3
 памятник:0.09  на:0.03 ,:0.02  в:0.02  жара:0.01  стол:0.01  не:0.01  такая:0.01  очередь:0.01  только:0.01  по:0.01  такой:0.01  один:0.00 .:0.00  очень:0.00  новый:0.00  &:0.00  гроб:0.00  прекрасная:0.00  тишина:0.00 
4
 много:0.11  высокая:0.06  большой:0.04  большая:0.03  высокий:0.02  большое:0.02 -:0.01  сильный:0.01  хорошая:0.01  странная:0.01  высокое:0.01  низкая:0.01  прилич:0.01  прохлад:0.01  дорого:0.01  внуш:0.01  странное:0.01  теплая:0.01  сильное:0.01  сложная:0.01 
5
 погода:0.22 ,:0.07  гостиница:0.03  мебель:0.02  церковь:0.01  и:0.01  выставка:0.01  (:0.01  тишина:0.01  стол:0.01  цена:0.01  квартира:0.01  традиция:0.01  статуя:0.01  архитек:0.01  картина:0.01  &:0.01 .:0.01  очередь:0.01  русская:0.00 
6
,:0.46 .:0.31  и:0.05 ::0.03  для:0.02  -:0.01 ;:0.01  (:0.01 &:0.01  —:0.01  в:0.01  с:0.01  –:0.00 ...:0.00  &:0.00  на:0.00 :0.00 !:0.00 …:0.00 :0.00 
7
:0.24 :0.08  В:0.03  Но:0.03  И:0.02 :0.02  А:0.02  Я:0.01  На:0.01  Это:0.01 &:0.01  С:0.01  Не:0.01  У:0.01  По:0.01 В:0.01  Так:0.01  Как:0.01  Если:0.00  Он:0.00 
8
В:0.06 .:0.06 ,:0.05 в:0.01 ::0.01 С:0.01 &:0.01 :0.01 (:0.01 =:0.01  в:0.01  В:0.01  и:0.01 По:0.01 О:0.01 ++:0.01 Но:0.01 пол:0.00 Из:0.00 Пол:0.00

What is the purpose of weights named transformer.h.N.attn.masked_bias?

I'm trying to use the small GPT-3 checkpoint for finetuning using the DialoGPT repository, but it says the "transformer.h.N.attn.masked_bias" are unexpected. What layers do these weights refer to?

Google Colab incoming?

We'll have a colab notebook for it soon?

ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py)

how to solve this error in google colab?
как решить эту проблему в google colab?

Поддерживает только CUDA?

с цпу не будет работать?

Truncated words on GPT-3medium output

Hi! I use the GPT-2 model for the seq2seq task, but unfortunately, at the output of the model, words are cut off and sentences are not added, how can I make the model end sentences at the output and not cut off words? (increasing the maximum length does not correct the situation)

Is it possible to train models in Google Colab?

I've tried training both ruGPT3Large and ruGPT2Large in Google Colab using P100 as GPU. However, cuda runs out of memory at the very beginning of the training. Is it possible to use Google Colab for training models and if yes, how it can be done? A working example would be very helpful.

CUDA out of memory

Что то прям вообще сыро. Я не говорю о том что его вообще не запустить на этом колабе.

Но вы даже на колабе не изменили нужные библиотеки, приходится вручную устанавливать древний torch...

Жаль что такой негативный опыт, надеялся на что то лучше.

Pretraining script can't find train data.

I'm running this code

!python3 pretrain_transformers.py \
    --output_dir ="/content/output" \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt2large \
    --do_train \
    --train_data_file=./dataset/train/train.txt \
    --do_eval \
    --eval_data_file=./dataset/validation/validation.txt \
    --fp16

And I've got an error

10/26/2020 19:24:43 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1000000000000, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='./dataset/validation/validation.txt', evaluate_during_training=False, fp16=True, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=False, mlm_probability=0.15, model_name_or_path='sberbank-ai/rugpt2large', model_type='gpt2', n_gpu=1, no_cuda=False, num_train_epochs=1.0, output_dir='=/content/output', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=4, save_steps=500, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name=None, train_data_file='./dataset/train/train.txt', warmup_steps=0, weight_decay=0.01) 10/26/2020 19:24:43 - INFO - __main__ - Creating features from dataset file at ./dataset/train 10/26/2020 19:24:53 - INFO - __main__ - Saving features into cached file ./dataset/train/gpt2_cached_lm_1000000000000_train.txt Traceback (most recent call last): File "pretrain_transformers.py", line 782, in <module> main() File "pretrain_transformers.py", line 731, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "pretrain_transformers.py", line 212, in train train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/sampler.py", line 96, in __init__ "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0

Anyway, !cat ./dataset/train/train.txt runs properly.

Как обучать для NER

Хотел обучить для NER малую модель - сделал обуч. набор (300 000 образцов) с образцами такого формата:

...
<s>sent: Apple подавала в суд против Samsung относительно технологии разблокирования устройства.
ent: Apple
cat: IMW</s>
<s>sent: Корейцы же пытались взять верх, подав в суд против Apple относительно некоторых беспроводных технологий.
ent: Apple
cat: IMW</s>
...

... сделал также валидационный набор такого же формата (не оч. понял зачем он тренировочному скрипту).

Как я понял, модель должны обучиться формату т.е. я ей на вход даю незаконченный образец, а она его дописывает.
Результаты такие:

out = model.generate(input_ids.cuda(), max_length=50, repetition_penalty=5.0, do_sample=True, top_k=5, top_p=0.95, temperature=1.5)

<s>sent: Так, в описании принципа работы технологии, указанного в патенте, есть существенная разница между тем, что предлагает Apple, и тем, что предлагает Samsung.
ent: 
 В патентном соглашении говорится следующее Каждый

 <s>sent: Это конфликт между разработчиками/программистами и самой Microsoft.
ent:  
.Microsoft\ServicesWorldNetCatcherToolboxProviderReceive

<s>sent: Microsoft пока ещ не готова создать новый API, и вместо этого она свалила на разработчиков Windows Presentation Foundation, Silverlight, и XNA.
ent:  
/Стизир/.

<s>sent: Однако графические процессоры за прошедшие годы стали куда мощнее и взаимодействовать с ними стало куда удобнее, но графический движок Windows просто не умел с ними работать.
ent:
cat-1b

... т.е. вроде как модель не обучилась формату - не дописывает окончание - верно?
Т.е. нужно больший набор на для обучения?

Examples of input for training.

Is there any script or function for preprocessing of the text data?
Is it okay to use a train file, that looks like:
"abc"\n "def"\n "ghi"\n
Or should it be something like
{"text":"abc"}\n {"text":"def"}\n {"text":"ghi"}\n

So, can it be a raw text with "\n", or should I convert it into jsonl with only one field "text"? I've seen that "We support three file formats.." but cant find the examples or preprocessors.

Thanks for help!

Experimenting with finetuning GPT2Large on Colab's V100

Welp, I've got it to finetune the model, but something seems off. When trying to generate anything with finetuned model, I'm getting the error about probability being zero, negative or infinity. It seems so because of block_size < 1024, or because of Apex O3 optimizaton level. Is there anything I can do to fix this? Seems like I'm so close to actually getting it to work but everytime something goes wrong :/

Config I use for training:

!cd ru-gpts && python pretrain_transformers.py \
    --output_dir=../checkpointss \
    --model_type=gpt2 \
    --model_name_or_path=../gpt2_large_bbpe_v50 \
    --do_train \
    --train_data_file=/content/dataset.txt \
    --fp16 \
    --fp16_opt_level O3 \
    --per_gpu_train_batch_size 1 \
    --num_train_epochs 2 \
    --block_size=768 \
    --overwrite_output_dir \

Considering others succesfully trained large and even XL on colab gpu's, I think it is actually possible, and has drastic difference in quality comparing with GPT-3 Small, that is offered for finetuning.

The download quota for these files has been exceeded

Thank you so much for an amazing job. It’s a great pleasure to see Sber commitment to open AI research.

May I ask you to rehost models using more friendly method. I get the message “The download quota for these files has been exceeded”. A direct http url or torrent client magnet link is better for sharing large files.

Pretraining of ruGPT3Large issues

I am totally confused about the details of GPT3 pretraining. First of all, file pretrain_ruGPT3Large.sh tries to run an unexisting python script pretrain_gpt2.py. However, in readme pretrain_gpt2.py is replaced with pretrain_megatron.py.

Secondly, there are some strange things about GPT3 checkpointing. The pretrain_megatron.py script operates model checkpoints as directories with .pt dumps, meanwhile, generate_ruGPT3Large.py takes as input a directory with model.bin, vocab.json, etc. Also, the directory of this format (and not .pt file) can be downloaded from Google Drive.

So, how should I finetune ruGPT3Large? It is obvious that just calling pretrain_megatron.py is not the correct way - at least because it starts from random due to mismatch of checkpoint files.

Training-test data contamination in the essay example

I was excited to see outstanding perplexities (8 and 3) for essays generation. It’s so good that I decided to check for data leakage. Unfortunately, your training set data contains all validation set data. The resulting model could be overfit if you did more than one epoch (it’s impossible to know without a proper validation set).

Why is it impossible to finetune GPT-2 Large on V100?..

I don't quite understand the reason honestly. Colab provides V100 if you're a premium user, and I tried to run GPT-2 Large training (with fp16 and batch size 1), but it still runs out of memory. Original GPT-2 774M and even 1.5B were finetuning just perfectly. What's exactly different in russian model?

Чувствительные данные в обучающем корпусе

Здравствуйте.
Скажите, пожалуйста, обучающий корпус каким-нибудь образом очищался от чувствительных данных? Или при обучении использовались какие-то методы, мешающие получению непреднамеренно запомненных моделью данных? Например, методы дифференциальной конфиденциальности?
Просто не хотелось бы, чтобы при генерации текста кто-то из пользователей получил ответ, содержащий персональные данные какой-нибудь личности.
Спасибо.

Unable to run example code

Unable to run interference. I'm trying to run the example code from readme. This error appears

user@2ae38096b475:/ru-gpts$ python3 gpt_bot.py
Traceback (most recent call last):
  File "gpt_bot.py", line 4, in <module>
    from generation_wrapper import RuGPT3XL
  File "gw/generation_wrapper.py", line 13, in <module>
    import mpu
  File "/ru-gpts/mpu/__init__.py", line 50, in <module>
    from .transformer import BertParallelSelfAttention
  File "/ru-gpts/mpu/transformer.py", line 34, in <module>
    from torch_blocksparse_cpp_utils import make_layout
ModuleNotFoundError: No module named 'torch_blocksparse_cpp_utils'

Deepspeed report:

user@2ae38096b475:/ru-gpts$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 1.5.0
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/user/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.5, cuda 10.2

Ubuntu 18.02 in docker

Can't load a model from path.

I use a path to a downloaded model instead of using it's name to avoid downloading. Unfortunately, it doesn't work for me.

`/content/generate_transformers.py in predict(model_name, start, length_, temperature_)
191 raise KeyError("the model {} you specified is not supported. You are welcome to add it and open a PR :)")
192
--> 193 tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
194 model = model_class.from_pretrained(args.model_name_or_path)
195 model.to(args.device)

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in from_pretrained(cls, *inputs, **kwargs)
391
392 """
--> 393 return cls._from_pretrained(*inputs, **kwargs)
394
395 @classmethod

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
542 # Instantiate tokenizer.
543 try:
--> 544 tokenizer = cls(*init_inputs, **init_kwargs)
545 except OSError:
546 raise OSError(

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_gpt2.py in init(self, vocab_file, merges_file, errors, unk_token, bos_token, eos_token, **kwargs)
147 **kwargs
148 ):
--> 149 super().init(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
150 self.max_len_single_sentence = (
151 self.max_len

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in init(self, max_len, **kwargs)
335 assert isinstance(value, (list, tuple)) and all(isinstance(t, str) for t in value)
336 else:
--> 337 assert isinstance(value, str)
338 setattr(self, key, value)
339 `

The path looks like /content/gdrive/MyDrive/rugpt3small_based_on_gpt2/gpt3_small_ppl_21_8/

It also doesn't run if I use a path to the .tar
So, any solutions for using downloaded models? Thanks.

Docker container request

Hello!

It would be very helpful if you build and release docker containers with the models.

Thank you in advance.

Dataset collaboration

Hello,

Thank you for the high quality pre-trained model, it's super easy to deploy and use.

As you may know there is an ongoing community driven effort to replicate GPT3 with 175B parameters. A part of the project is building the dataset. The version 1 of the dataset is focused on English and is almost ready. The next goal is a fully-multilingual, 10TiB text dataset.

https://github.com/EleutherAI/The-Pile

Would you mind sharing yours dataset, so it can be part of the project?

Integration with wandb

Is there a possibility to add an integration with wandb? Or maybe it works already but I can't figure out how. Please, help

AssertionError raised when generating with GPT3Large

Hi!
I used the pretrained model from GDrive. AssertionError gets raised unless special_tokens_map.json is removed from pretrained model dir. The assertion expects bos_token to be a string instead of a dict (from that json).

bash ./scripts/generate_ruGPT3Large.sh
2020-11-04 22:38:45.094336: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   Model name '/storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming '/storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2' is a path, a model identifier, or url to a directory containing tokenizer files.
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   Didn't find file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/added_tokens.json. We won't load it.
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/vocab.json
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/merges.txt
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file None
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/special_tokens_map.json
11/04/2020 22:38:49 - INFO - transformers.tokenization_utils -   loading file /storage/Batyr/ru-gpts/checkpoints/rugpt3large_based_on_gpt2/tokenizer_config.json
Traceback (most recent call last):
  File "generate_transformers.py", line 269, in <module>
    main()
  File "generate_transformers.py", line 203, in main
    tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 545, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_gpt2.py", line 149, in __init__
    super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
  File "/storage/Batyr/ru-gpts/gpt_env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 337, in __init__
    assert isinstance(value, str), f'key: {key}, value: {value}'
AssertionError: key: bos_token, value: {'content': '<|endoftext|>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}

Tokenizer AssertionError while loading data

Hello!
I was trying to finetune your model on google colab, but I stumbled upon a "curious" error

Traceback (most recent call last):
  File "pretrain_megatron.py", line 714, in <module>
    main()
  File "pretrain_megatron.py", line 659, in main
    args.eod_token = get_train_val_test_data(args)
  File "pretrain_megatron.py", line 594, in get_train_val_test_data
    args)
  File "/content/ru-gpts/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/content/ru-gpts/configure_data.py", line 170, in make_loaders
    train, tokenizer = data_utils.make_dataset(**data_set_args)
  File "/content/ru-gpts/data_utils/__init__.py", line 101, in make_dataset
    pad_token, character_converage, **kwargs)
  File "/content/ru-gpts/data_utils/tokenization.py", line 43, in make_tokenizer
    return GPT2BPETokenizer(model_path=model_path, **kwargs)
  File "/content/ru-gpts/data_utils/tokenization.py", line 823, in __init__
    self.text_tokenizer = GPT2Tokenizer.from_pretrained(model_path, cache_dir=cache_dir)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 393, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 544, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_gpt2.py", line 149, in __init__
    super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py", line 337, in __init__
    assert isinstance(value, str)
AssertionError

Here are my arguments:

       --train-data /content/ru-gpts/data/train1.jsonl \
       --valid-data /content/ru-gpts/data/valid.jsonl \
       --test-data /content/ru-gpts/data/valid.jsonl \
       --save /content/ru-gpts/checkpoints/checkpoints_${now}_${host} \
       --load /content/ru-gpts/gpt3model \
       --save-interval 500 \
       --eval-interval 500 \
       --log-interval 100 \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 24 \
       --hidden-size 1536 \
       --num-attention-heads 16 \
       --seq-length 2048 \
       --max-position-embeddings 2048 \
       --vocab-size 50257 \
       --batch-size 1 \
       --train-iters 200000 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --lazy-loader \
       --checkpoint-activations \
       --loose-json \
       --text-key text \
       --tokenizer-path /content/ru-gpts/gpt3model \
       --tokenizer-type GPT2BPETokenizer \
       --finetune

I've got your model from archive from gdrive
What's wrong with tokenizer?
It's GPT3Large

reason for relicense to MIT?

If I understand correctly, according to the LICENSE file the project is given under the MIT license. However, most of the code is inherited from the GPT2 project produced by Nvidia under the Apache license.

Let me remind you two rules of Apache 1. license change requires agreement with all previous developers of the project 2. making changes to files under the Apache license requires an indication of what exactly was changed and for what purpose (a notification must be added stating that changes have been made to that file)

Is it possible to know what such an attempt to relicense was made for?

Обучение и использование.

Как я понимаю train.txt нужен для обучения?
Если да, то тогда как нужно вводить туда данные?

Пример как я пытаюсь записать это в train.txt:

<s>User1:Привет, как дела?
User2:Привет, у меня всё хорошо.</s>

документации нет, поэтому вообще хз, сделал примерно как в файле train который скачивается, и то там рандомно всё написано.

Но когда пытаюсь проверить натренированную модель, получаю:

User1:Привет, как дела?
User2:Привет, у меня всё хорошо.</s>
<s>User1:А как тебя зовут? Я Алёша, надеюсь

Как убрать всё лишнее? И как можно сделать что бы он не обрывал предложения?

Ошибка при попытке зафайнтюнить GPT3XL

При попытке запустить файнтюнинг на небольшом дебажном датасете получаю ошибку при вызове forward прохода. В чём может быть проблема?

Traceback (most recent call last):
  File "../pretrain_gpt3.py", line 832, in <module>
    main()
  File "../pretrain_gpt3.py", line 812, in main
    tokenizer)
  File "../pretrain_gpt3.py", line 472, in train
    args, timers, tokenizer, iteration, tb_writer)
  File "../pretrain_gpt3.py", line 406, in train_step
    lm_loss = forward_step(sample, model, args, timers, tokenizer, iteration, tb_writer)
  File "../pretrain_gpt3.py", line 298, in forward_step
    output = model(tokens, position_ids, attention_mask)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/DeepSpeed-triton2/deepspeed/runtime/engine.py", line 972, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/model/distributed.py", line 79, in forward
    return self.module(*inputs, **kwargs)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/fp16/fp16.py", line 72, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/model/gpt3_modeling.py", line 108, in forward
    transformer_output = self.transformer(embeddings, attention_mask)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 445, in forward
    hidden_states, attention_mask)
  File "/export/DeepSpeed-triton2/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 682, in checkpoint
    CheckpointFunction.apply(function, all_outputs, *args)
  File "/export/DeepSpeed-triton2/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 486, in forward
    outputs = run_function(*inputs_cuda)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 434, in custom_forward
    x_ = layer(x_, inputs[1])
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 301, in forward
    attention_output = self.attention(layernorm_output, ltor_mask)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/transformer.py", line 116, in forward
    mixed_x_layer = self.query_key_value(hidden_states)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/export/data/ipynb/ru-gpts/src/mpu/layers.py", line 243, in forward
    output_parallel = F.linear(input_parallel, self.weight, self.bias)
  File "/home/fellow/.virtualenvs/rugpt37/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Generating with deepspeed checkpoints

Hi!
Following Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb, I've finetuned a model using deepspeed. However, I can't use the checkpoint to generate a response, since generate_samples.py throws a KeyError:
A metadata file exists but unable to load model from checkpoint /iter_0030000/mp_rank_00/model_optim_rng.pt, exiting
The checkpoint's structure also looks different than a regular megatron checkpoint. What can I use to infer with a deepspeed checkpoint?

Буква Н в выводе

В generate_ruGPT3Large.sh и generate_transformers.py каждый вывод начинается с Н. Мне кажется, что так не было задумано изначально.

Context >>> на словах ты лев толстой
10/23/2020 14:45:27 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
Hна словах ты лев толстой,
А на деле — как все мы.
А на самом деле
Context >>> на словах ты лев толстой
10/23/2020 14:45:33 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
Hна словах ты лев толстой и пушистый! Я не могу тебя отпустить!

Я люблю тебя, - прошептал
Context >>> Hна словах ты лев толстой и пушистый! Я не могу тебя отпустить!
10/23/2020 14:45:44 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
HHна словах ты лев толстой и пушистый! Я не могу тебя отпустить! Ты будешь моей! Я хочу, чтоб мы были счастливы, и не важно, сколько теб
Context >>> privet
10/23/2020 14:46:00 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
Hprivet-rodina.ru/...✂ http://rodinavet.ru/...✂ http://r
Context >>> лол
10/23/2020 14:46:08 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
Hлол, как я уже говорил, была в том числе и частью «памятника» —
Context >>> ыеврвыкр
10/23/2020 14:46:21 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
Hыеврвыкр
Context >>> Hпривет
10/23/2020 14:50:47 - WARNING - transformers.modeling_utils - Setting pad_token_id to 50256 (first eos_token_id) to generate sequence
ruGPT:
HHпривет!

Can't process data when finetuning GPT3-XL

I tried to finetune GPT3-XL via deepspeed_gpt3_xl.sh script. I downloaded and prepared data as in Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb and also added argument --tokenizer-path sberbank-ai/rugpt3xl to deepspeed_gpt3_xl.sh.

But running the script through an error:

USE_DEEPSPEED=1 mpirun --np 1 python pretrain_gpt3.py --train-data-path train.list --test-data-path valid.list --logging-dir=logs/ --save model --save-interval 1000 --model-parallel-size 1 --num-layers 24 --hidden-size 2048 --num-attention-heads 16 --batch-size 1 --seq-length 2048 --max-position-embeddings 2048 --train-iters 5 --resume-dataloader --distributed-backend nccl --lr 0.0002 --lr-decay-style cosine --weight-decay 1e-2 --warmup .01 --log-interval 100 --fp16 --checkpoint-activations --deepspeed-activation-checkpointing --sparse-mode alternating --deepspeed --deepspeed_config src/deepspeed_config/gpt3_xl_sparse_2048.json --tokenizer-path sberbank-ai/rugpt3xl
2021-02-20 11:36:26.334556: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
using world size: 1 and model-parallel size: 1

using dynamic loss scaling
initializing model parallel with size 1
[2021-02-20 11:36:33,987] [INFO] [checkpointing.py:629:configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
Pretrain GPT3 model
arguments:
attention_dropout ............ 0.1
num_attention_heads .......... 16
hidden_size .................. 2048
intermediate_size ............ None
num_layers ................... 24
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 2048
vocab_size ................... 30522
deep_init .................... False
make_vocab_size_divisible_by . 8
cpu_optimizer ................ False
cpu_torch_adam ............... False
sparse_mode .................. alternating
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
batch_size ................... 1
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing True
clip_grad .................... 1.0
train_iters .................. 5
log_interval ................. 100
logging_dir .................. logs/
exit_interval ................ None
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... None
lr_decay_style ............... cosine
lr ........................... 0.0002
min_lr ....................... 1e-06
warmup ....................... 0.01
save ......................... model
save_interval ................ 1000
no_save_optim ................ False
no_save_rng .................. False
load ......................... None
no_load_optim ................ False
log_memory ................... False
no_load_rng .................. False
load_huggingface ............. None
export_huggingface ........... None
huggingface_double_pos_embeddings False
load_tag .....................
cache_prefix ................. _
finetune ..................... False
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
tg_token_name ................ token.txt
model_parallel_size .......... 1
shuffle ...................... False
train_data ................... None
use_npy_data_loader .......... False
train_data_path .............. train.list
val_data_path ................
test_data_path ............... valid.list
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 1000,1,1
test_data .................... None
overwrite_cache .............. False
lazy_loader .................. False
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_path ............... sberbank-ai/rugpt3xl
cache_dir .................... None
use_tfrecords ................ False
seq_length ................... 2048
max_files_per_process ........ 50000
max_preds_per_seq ............ None
deepspeed .................... True
deepspeed_config ............. src/deepspeed_config/gpt3_xl_sparse_2048.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 1
dynamic_loss_scale ........... True
[2021-02-20 11:36:33,987] [INFO] [checkpointing.py:256:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
Load tokenizer from sberbank-ai/rugpt3xl
Load RuGPT3 Dataset from train.list, 50000 files per process
/home/atuthvatullin/environments/albert/lib/python3.6/site-packages/tensorflow/python/autograph/utils/testing.py:21: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
R0/1: Loading dataset train.list
R0/1: Check filelist train.list with root dir
R0/1: Shard [0, 1]
R0/1: Loaded 0/1 files
Traceback (most recent call last):
File "pretrain_gpt3.py", line 830, in
main()
File "pretrain_gpt3.py", line 783, in main
train_data, val_data, test_data, args.vocab_size, args.eod_token, tokenizer = get_train_val_test_data(args)
File "pretrain_gpt3.py", line 681, in get_train_val_test_data
(train_data, val_data, test_data), num_tokens, eod_token, tokenizer = make_gpt3_dataloaders(args)
File "/home/atuthvatullin/ru-gpts2/src/gpt3_data_loader.py", line 104, in make_gpt3_dataloaders
train = make_data_loader(args.train_data_path, train_dataset_args) if args.train_data_path else None
File "/home/atuthvatullin/ru-gpts2/src/gpt3_data_loader.py", line 93, in make_data_loader_
file_path=data_path,
File "/home/atuthvatullin/ru-gpts2/src/dataset_rugpt3.py", line 130, in init
self.examples = np.vstack(examples)
File "<array_function internals>", line 6, in vstack
File "/home/atuthvatullin/environments/albert/lib/python3.6/site-packages/numpy/core/shape_base.py", line 283, in vstack
return _nx.concatenate(arrs, 0)
File "<array_function internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[44749,1],0]
Exit code: 1

A strange error when running the example

Python 3.6

> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234Downloading: 100%|████████████████████████████| 1.57M/1.57M [00:00<00:00, 1.85MB/s]Downloading: 100%|████████████████████████████| 1.23M/1.23M [00:00<00:00, 1.41MB/s]Downloading: 100%|████████████████████████████| 2.63G/2.63G [03:49<00:00, 11.5MB/s]Downloading: 100%|█████████████████████████████████| 653/653 [00:00<00:00, 329kB/s]Traceback (most recent call last):
  File "gpt_bot.py", line 8, in <module>
    gpt = RuGPT3XL.from_pretrained("sberbank-ai/rugpt3xl", seq_len=512)
  File "gw/generation_wrapper.py", line 179, in from_pretrained
    model = setup_model(weights_path, deepspeed_config_path)
  File "gw/generation_wrapper.py", line 82, in setup_model
    model = get_model(deepspeed_config_path)
  File "gw/generation_wrapper.py", line 71, in get_model
    sparse_mode=sparse_mode)
TypeError: __init__() got an unexpected keyword argument 'deepspeed_sparsity_config'```

eos_token_id is None

The tokenizer has a special token eos_token = '<|endoftext|>'. However, this token seems to be missing in vocabulary.
Hence, when I try to encode a sequence like so:

tokenizer.encode("привет" + tokenizer.eos_token"), I get [960, 577, None] which follows to

Traceback (most recent call last):
  File "/home/superuser/khovrichev/gpt2bot/run_console_bot.py", line 14, in <module>
    run_bot(**config)
  File "/home/superuser/khovrichev/gpt2bot/gpt2bot/console_bot.py", line 66, in run_bot
    bot_messages = generate_text(prompt, pipeline, **generator_kwargs)
  File "/home/superuser/khovrichev/gpt2bot/gpt2bot/utils.py", line 301, in generate_text
    responses = generator.generate_next(prompt)
  File "/home/superuser/khovrichev/gpt2bot/gpt2bot/utils.py", line 103, in generate_next
    encoded_prompt = self.tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
  File "/home/superuser/khovrichev/ru-gpts/gpt_env/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 919, in encode
    **kwargs,
  File "/home/superuser/khovrichev/ru-gpts/gpt_env/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1069, in encode_plus
    return_special_tokens_mask=return_special_tokens_mask,
  File "/home/superuser/khovrichev/ru-gpts/gpt_env/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1463, in prepare_for_model
    encoded_inputs["input_ids"] = torch.tensor([encoded_inputs["input_ids"]])
RuntimeError: Could not infer dtype of NoneType

Why eos_token is present in the tokenizer but is absent in vocab?

deepspeed installation error

Ошибка при установке deepspeed, не устанавливаются расширения для cpu_adam и sparse_attention. При установке версии 0.3.7 с заданными параметрами не выдает никаких ошибок, но расширения не устанавливаются.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/home/nikita/.local/lib/python3.8/site-packages/torch']
torch version .................... 1.6.0+cu101
torch cuda version ............... 10.1
nvcc version ..................... 10.1
deepspeed install path ........... ['/home/nikita/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.3.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.6, cuda 10.1

При попытке установки версии 0.3.11 или запуска установочного скрипта из репозитория deepspeed с теми же параметрами выдает ошибку, лог ошибки при запуске установочного скрипта:

No hostfile exists at /job/hostfile, installing locally
Building deepspeed wheel
DS_BUILD_OPS=0
Install Ops={'cpu_adam': 1, 'fused_adam': False, 'fused_lamb': False, 'sparse_attn': 1, 'transformer': False, 'stochastic_transformer': False, 'utils': False}
version=0.3.11+29fa4b2, git_hash=29fa4b2, git_branch=master
install_requires=['torch>=1.2', 'torchvision>=0.4.0', 'tqdm', 'tensorboardX==1.8', 'ninja', 'numpy', 'triton==0.2.3']
compatible_ops={'cpu_adam': True, 'fused_adam': True, 'fused_lamb': True, 'sparse_attn': True, 'transformer': True, 'stochastic_transformer': True, 'utils': True}
ext_modules=[<setuptools.extension.Extension('deepspeed.ops.adam.cpu_adam_op') at 0x7fcb7be90820>, <setuptools.extension.Extension('deepspeed.ops.sparse_attention.sparse_attn_op') at 0x7fcb7be909d0>]
running bdist_wheel
running build
running build_py
copying deepspeed/git_version_info_installed.py -> build/lib.linux-x86_64-3.8/deepspeed
running egg_info
writing deepspeed.egg-info/PKG-INFO
writing dependency_links to deepspeed.egg-info/dependency_links.txt
writing requirements to deepspeed.egg-info/requires.txt
writing top-level names to deepspeed.egg-info/top_level.txt
reading manifest file 'deepspeed.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '.cc' under directory 'deepspeed'
warning: no files found matching '.tr' under directory 'csrc'
warning: no files found matching '*.cc' under directory 'csrc'
writing manifest file 'deepspeed.egg-info/SOURCES.txt'
running build_ext
building 'deepspeed.ops.adam.cpu_adam_op' extension
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Icsrc/includes -I/usr/local/cuda/include -I/home/nikita/.local/lib/python3.8/site-packages/torch/include -I/home/nikita/.local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/nikita/.local/lib/python3.8/site-packages/torch/include/TH -I/home/nikita/.local/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c csrc/adam/cpu_adam.cpp -o build/temp.linux-x86_64-3.8/csrc/adam/cpu_adam.o -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=cpu_adam_op -D_GLIBCXX_USE_CXX11_ABI=0
x86_64-linux-gnu-gcc: error: : Нет такого файла или каталога
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Error on line 155
Fail to install deepspeed

Это выдает cpufeature.print_features(), может быть пригодится:

=== CPU FEATURES ===
VendorId : AuthenticAMD
num_virtual_cores : 16
num_physical_cores : 8
num_threads_per_core : 2
num_cpus : 0
cache_line_size : 64
cache_L1_size : 0
cache_L2_size : 0
cache_L3_size : 0
OS_x64 : True
OS_AVX : True
OS_AVX512 : False
MMX : True
x64 : True
ABM : True
RDRAND : True
BMI1 : True
BMI2 : True
ADX : True
PREFETCHWT1 : False
MPX : False
SSE : True
SSE2 : True
SSE3 : True
SSSE3 : True
SSE4.1 : True
SSE4.2 : True
SSE4.a : True
AES : True
SHA : True
AVX : True
XOP : False
FMA3 : True
FMA4 : False
AVX2 : True
AVX512f : False
AVX512pf : False
AVX512er : False
AVX512cd : False
AVX512vl : False
AVX512bw : False
AVX512dq : False
AVX512ifma : False
AVX512vbmi : False

which x86_64-linux-gnu-gcc
/usr/bin/x86_64-linux-gnu-gcc

gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Заранее благодарю за помощь!

Не получается запустить скрипты на torch 1.8.0

Пытаюсь запустить скрипты но выходит ошибка ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' Погуглив, понял, что нужна версия torch 1.4.0 но ее уже нет, а на 1.8.0 такая ошибка, можете обновить скрипты до актуальных версий? Или как быть в такой ситуации?

ruGPT3Large/ruGPT2Large мало весят

Добрый день!
Вы уверены, что модели ruGPT3Large и ruGPT2Large должны весить 3 Гб?
С таким размером я смог их запустить даже на своей nvidia 1060 и она занимала всего 4 Гб!

Torch takes almost all the memory even on large GPU

I tried to run code and pretrained network provided in this notebook,
https://colab.research.google.com/github/sberbank-ai/ru-gpts/blob/master/examples/Finetune_RuGPTs_with_HF.ipynb
but I can't run more then one batch per GPU with my data. It is not very big - 90 MB, but it takes forever to do with 1 batch per step.

So, every time I run this command

!CUDA_VISIBLE_DEVICES=0 python ru-gpts/pretrain_transformers.py \
    --output_dir=models/essays \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt3small_based_on_gpt2 \
    --do_train \
    --train_data_file=train.txt \
    --do_eval \
    --eval_data_file=valid.txt \
    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 5 \
    --block_size 2048 \
    --overwrite_output_dir

with --per_gpu_train_batch_size > 1 I get an error RuntimeError: CUDA out of memory..
The error shows that >90% of memory is allocated to torch and the rest is not enough to run multiple batches. And this happens on GPUs with any amount of memory: 10, 15, 30 GB.

Could you please fix the amount of memory preallocated to pytorch.

num_samples error?

I'm trying to finetune GPT-2 Large and I get that error:

ValueError: num_samples should be a positive integer value, but got num_samples=0

What is that? Googling says that dataset is missing, but i've checked twice, and the path is absolute correct. Tried with both absolute and non-absolute path.

ai-forever / ru-gpts Goto Github PK

ru-gpts's Introduction

Russian GPT-3 models

ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large

Table of contents

ruGPT3XL

Setup

Usage

Finetuning

Pretraining details ruGPT3XL

ruGPT3Large, ruGPT3Medium, ruGPT3Small, ruGPT2Large

Setup

Usage

Pretraining details

Pretraining details ruGPT3Large

Pretraining details ruGPT3Medium

Pretraining details ruGPT3Small

Pretraining details ruGPT2Large

OpenSource Solutions with ruGPT3

Papers mentioning ruGPT3

Text Simplification

Text Detoxification

Paraphrasing and Data Augmentation

Model Evaluation

ru-gpts's People

Contributors

Stargazers

Watchers

Forkers

ru-gpts's Issues

Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

Process name: [[44749,1],0] Exit code: 1

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

Recommend Projects

Recommend Topics

Recommend Org

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

Process name: [[44749,1],0]
Exit code: 1

JIT compiled ops requires ninja
ninja .................. [OKAY]