flagai-open / flagai Goto Github PK

View Code? Open in Web Editor NEW

3.8K 3.8K 415.0 112.1 MB

FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use and extensible toolkit for large-scale model.

License: Apache License 2.0

Python 99.89% Shell 0.04% Dockerfile 0.07%

flagai's People

Contributors

Stargazers

Watchers

Forkers

geekplusaa george-han wangguojim dumpmemory luomor-ai sport20speed fade-color yinzh fuyinno4 shenzaimin baai-open-internal bowen92 nicecodeforked benjinglin kunlun-zhu shadowkun xggz luckygirl-lu wpq3142 enockipp mwsssxu zhiyuan-fan zhangfaquan jackyin5918 ledw-2 flagopen lindylin1817 chenyutongthu newsky robotpin coding1018 techthiyanes mbrukman straitrobot so2bin lijun20 shism2 wut0n9 cindyaud jackleiaaaaaa jrcribb marksmayo ssahgal timothyzhang quan-sun jaedukseo lubakabra suryatmodulus sumonst21 kirsireinken99 felipyfuga nanderoo shunxing1234 k-nearest-neighbor richardsonjf macko-pp baai-openplatform mldl rockiesiyuanzhang oneflow-inc maxmax2016 vincentwei2021 fqq11679 guankaisi edisonc72 lhbzx1984 quanquanshixiaogongzhu louisheck tianbuwei winstonwuxingang siyuan-zhou047 tzlby yuchen202 superhero-7 cv-synthesis gloriayy ftgreat gg-big-org undercontroller wolfworld6 ybqu hou-jing leemengtw algorithm-learning-community-for-python zhouao0314 ronghuiju kivvf julienze rickylovefreedom xttd188aa ioannisgkouzionis unesco3187 liuyongs1 sxyseo shania7 amutong yqgao716 flowbywind marscrazy liujuncn

flagai's Issues

Is Apple silicon is supported?

I want to know if OPT is supported on M1/M1Pro/M1Max?

古诗生成任务无法复现结果

请问 tutorial 中古诗生成的效果是用多大数据量 finetune 的模型生成的结果？我使用样例数据 finetune 出来结果并不是很好

CLIP微调/后的模型如何导出

trainer.train(model=model, train_dataset=dataset, collate_fn=cifar10_collate_fn)
请问 train 完成后的模型如何导出用于推理计算

AltCLIP training for the first stage

Hi, thanks for your great work about AltCLIP.

In the paper, for the first stage to train AltCLIP, the teacher model text embedding is extracted from origin CLIP text encoder as [TOS] token, is this [TOS] token according to open-ai's CLIP: https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/clip/model.py#L354
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)]

But the Figure-1 in paper use [EOS] token, is the [TOS] and [EOS] token the same with open-ai's impelementation?

And the student model XLM-R use [CLS] token as text embedding to calculate teacher and student MSE loss ?

[BUG] Runtime Error from CLUE example

Describe the bug
train_10b_clue.py from ./examples/glm_superglue
afqmc task in CLUE
pytorch setting

task_name = 'afqmc'
trainer = Trainer(env_type="pytorch",
                  batch_size=16,
                  epochs=10,
                  eval_interval=10,
                  load_dir=None,
                  pytorch_device="cuda",
                  save_dir="./glm_superglue_en",
                  save_epoch=1)

model = GLMForSingleTokenCloze.from_pretrain(download_path="/mnt/test_10b_models",
                                             model_name="GLM-large-ch")


tokenizer =  GLMLargeChTokenizer()
train_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='train',
                                 tokenizer=tokenizer,
                                 cloze_eval=True)
valid_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='dev',
                                 tokenizer=tokenizer,
                                 cloze_eval=True)

cl_args = CollateArguments()
cl_args.cloze_eval = True
cl_args.multi_token = False

collate_fn = ConstructSuperglueStrategy(cl_args,
                                        tokenizer,
                                        task_name=task_name)
trainer.train(model,
              train_dataset=train_dataset,
              valid_dataset=valid_dataset,
              collate_fn=collate_fn,
              metric_methods=[["acc", accuracy_metric]])

Tasks

An officially supported task in the examples folder (such as GLUE/Title-generation, ...)
My own task or dataset

To Reproduce

(base) root@deepspeed:~/FlagAI/examples/glm_superglue# python train_10b_clue.py
file cog-pretrain.vocab not exist in ['cog-pretrain.model', 'cog-pratrain.vocab', 'pytorch_model.bin', 'vocab.txt', 'config.json', 'README.md']
{'pad': 50000, 'eos': 50000, 'sep': 50001, 'ENC': 50002, 'MASK': 50003, 'unk': 50004, 'sop': 50006, 'eop': 50007, 'sMASK': 50008, 'gMASK': 50009}
Creating afqmc dataset from file at ./datasets/ (split=train)
Returning 34334 train examples with label dist.: [('0', 23761), ('1', 10573)]
Creating afqmc dataset from file at ./datasets/ (split=dev)
Returning 4316 dev examples with label dist.: [('0', 2978), ('1', 1338)]
Optimizer = Adam
[2022-06-09 10:10:29,530] [INFO] [logger.py:70:log_dist] [Rank -1] loading checkpoints form None
[2022-06-09 10:10:29,530] [INFO] [logger.py:70:log_dist] [Rank -1] working on epoch 0 ...
Traceback (most recent call last):
  File "train_10b_clue.py", line 64, in <module>
    trainer.train(model,
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/trainer.py", line 460, in train
    lm_loss, skipped_iter, _ = self.train_step(batch,
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/trainer.py", line 568, in train_step
    step_output = self.forward_step(data, model, mems)
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/trainer.py", line 635, in forward_step
    model_output = model(**data)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/model/glm_model.py", line 754, in forward
    model_out = self.model(input_ids,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/model/glm_model.py", line 453, in forward
    loss = F.cross_entropy(logits_parallel.contiguous().float(),
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2996, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [16, 50048], got [16]

OS (please complete the following information):

Version : v1.0.1

Add Dreambooth example for AltDiffusion

Hi, I see that the diffusers can already support Altdiffusion. And I try dreambooth on Altdiffusion by using the diffusers.It just need to change original StableDiffusion Pipeline to AltDiffusion Pipeline,and replace the text encoder.And I get results that looks great!

Here are some results I generated in Chinese.I use the special token <鸣人> to represent Uzumaki Naruto.

Prompt 一张<鸣人>男孩的照片，背景是沙漠，masterpieces

Prompt: 一张<鸣人>男孩的照片，背景是富士山，masterpieces

Prompt：一张<鸣人>男孩的照片，铅笔素描

Prompt：一张<鸣人>男孩的照片，油画梵高风格

I'm curious that whether flagai could support the dreambooth? Thanks!

About the logo size of FlagAI in README file

FlagAI的logo，上下有很大的空白，导致readme文件看起来前面不太均衡。检查一下是否可以减少logo上下的空白？谢谢

关于AltCLIP Ablation study结果的问题

Hi 你们好，感谢你们的工作。
altCLIP 论文中Table[1]中report AltCLIP_T在ImageNet1k上的结果为(74.5, 59.6),但Table[3]-ablation study中的结果变成了(51.61, 41.66)

请问这是怎么回事呢？

初始化 AutoLoader 报TypeError: init() got multiple values for argument 'text_config'

1、初始化代码：
auto_loader = AutoLoader(
task_name="txt_img_matching",
model_dir="./checkpoints",
model_name="AltCLIP-XLMR-L" # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)

2、错误：
│ d:\Users\bigdata\Anaconda3\lib\site-packages\transformers\configuration_utils.py:688 in │
│ from_dict │
│ │
│ 685 │ │ if "_commit_hash" in kwargs and "_commit_hash" in config_dict: │
│ 686 │ │ │ kwargs["_commit_hash"] = config_dict["_commit_hash"] │
│ 687 │ │ │
│ ❱ 688 │ │ config = cls(**config_dict) │
│ 689 │ │ │
│ 690 │ │ if hasattr(config, "pruned_heads"): │
│ 691 │ │ │ config.pruned_heads = dict((int(key), value) for key, value in config.pruned │
│ │
│ C:\Users\bigdata\AppData\Roaming\Python\Python38\site-packages\flagai\model\mm\AltCLIP.py:79 in │
│ init │
│ │
│ 76 │ │ │ │ num_layers=3, │
│ 77 │ │ │ │ variant='invert', │
│ 78 │ │ │ │ **kwargs): │
│ ❱ 79 │ │ super().init(text_config_dict, vision_config_dict, projection_dim, │
│ 80 │ │ │ │ │ │ logit_scale_init_value, **kwargs) │
│ 81 │ │ if text_config_dict is None: │
│ 82 │ │ │ text_config_dict = {}

cannot import name 'clock_settime' from 'time' (unknown location)

Describe the bug
A clear and concise description of what the bug is.

Tasks

An officially supported task in the examples folder (such as GLUE/Title-generation, ...)
My own task or dataset

To Reproduce

Error code

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

OS (please complete the following information):

OS: [e.g. ubuntu18.04]
Version [e.g. v1.0.0]

Additional context
Add any other context about the problem here.

glm_title_ch.py报错KeyError: 'position_ids'

本文总结了十个可穿戴产品的设计原则，而这些原则同样也是笔者认为是这个行业最吸引人的地方，1为人们解决重复性问题2从人开始而不是从机器开始3要引起注意但不要刻意4提升用户能力而不是取代人。 :
--------------sample 0 :-------------------
-----------random sample: --------------
{'input_ids': [23694, 35526, 12895, 43392, 32153, 2837, 101, 1369, 43359, 24733, 1369, 1736, 88, 11921, 5789, 15658, 43469, 39550, 247, 4153, 43377, 797, 341, 3075, 30639, 43372, 43576, 43371, 71, 1878, 43576, 1354, 71, 43393, 43385, 817, 295, 30057, 7692, 43413, 852, 439, 169, 1878, 6170, 43371, 43361], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
Traceback (most recent call last):
File "glm_title_ch.py", line 33, in
predictor.predict_generate_randomsample(text,
File "/home/lz/miniconda3/envs/flagai/lib/python3.8/site-packages/flagai/model/predictor/predictor.py", line 288, in predict_generate_randomsample
return glm_random_sample(self.model, self.tokenizer, text,
File "/home/lz/miniconda3/envs/flagai/lib/python3.8/site-packages/flagai/model/predictor/utils.py", line 600, in glm_random_sample
position_ids = torch.tensor([data['position_ids']],
KeyError: 'position_ids'

您好，readme里面提到FlagAI支持大模型部署，但没看到相关案例？另外，FlagAI支持推理加速吗？

Describe the question
A clear and concise description of what the question is.

Additional context
Add any other context about the question here.

根据CLIP微调/Finetuning例子, loss为None

您好，按照readme中的例子我是用自己的数据集进行finetune 在第一个iteration之后得到的lm_loss全为None
在forward_step方法里打印了data 数据是正常的，但无法获取正确的model_output，请大神帮忙看下

[TypeError: accuracy_metric() got an unexpected keyword argument 'tokenizer']

Describe the bug
A clear and concise description of what the bug is.

Tasks

glm_superglue
tnews

To Reproduce

Traceback (most recent call last):
  File "train_large_clue.py", line 51, in <module>
    trainer.train(model,
  File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/flagai-1.6.1-py3.8.egg/flagai/trainer.py", line 598, in train
    eval_dict = self.evaluate_and_print_results(
  File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/flagai-1.6.1-py3.8.egg/flagai/trainer.py", line 1103, in evaluate_and_print_results
    eval_dict = self.evaluate(forward_step_func=forward_step_func,
  File "/root/anaconda3/envs/py38/lib/python3.8/site-packages/flagai-1.6.1-py3.8.egg/flagai/trainer.py", line 1051, in evaluate
    metrics[i] += eval_method(all_logits, all_labels, meta=meta, tokenizer=self.tokenizer)
TypeError: accuracy_metric() got an unexpected keyword argument 'tokenizer'

对应方法函数：

"""train_large_clue.py""" 
trainer.train(model,
              train_dataset=train_dataset,
              valid_dataset=valid_dataset,
              collate_fn=collate_fn,
              metric_methods=[["acc", accuracy_metric]])


"""flagai.metrics.accuracy_metric.py""" 

def accuracy_metric(predictions, labels, meta=None):
    '''
    predictions: torch.size(n, class_num)
    labels: torch.size(n)
    '''
    count = 0
    assert len(predictions) == len(labels)
    if predictions.size() != labels.size():      
        predictions = torch.argmax(predictions, dim=-1)
        for prediction, label in zip(predictions, labels):
            count += prediction == label
    else:
        prediction, label = predictions[0], labels[0]
        
        if sigmoid(prediction) >= 0.5:
            count += label == 1
        else:
            count += label == 0
    return 100.0 * count / len(labels)

Typo found in the first sentence of readme file

FlagAI (Fast LArge-scale General AI models) is an fast, easy-to-use and extensible toolkit for large-scale models.

T5-large-en/ch T5-3B-en/ch Model please

https://huggingface.co/docs/transformers/model_doc/t5

有没有随机权重初始化加全量数据复现GLM预训练模型的代码啊？

如题
现在工程整体的一个问题是缺乏具体有效训练的代码
examples中的例子都是极小数据量的除非GLM有很强的few shot能力
否则无法使得使用者能根据自己的数据验证训练过程及模型的有效性。

已经训练好的模型：如 GLM-large-ch 及这些可预先加载的模型的效果都非常好
如果能给出这些模型从随机初始化及全量数据到训练完成的过程则会非常好。

也就是这个工程在开箱即用的意义下非常好，但在如何进行复刻和全量数据调试上缺乏根据。

能多开源一些相关的部分吗？谢谢

[BUG]

你好，我在运行古诗生成任务的推理代码时遇到这个问题，请问应该如何处理

[Question] 在CMRC下游任务进行训练之后如何估计呢？

是这样吗？

text = '''
问题:1994年3月,范廷颂担任什么职务?答案:[MASK] 范廷颂枢机(,),圣名保禄·若瑟(),是越南罗马天主教枢机。1963年被任为主教;1990年被擢升为天主教河内总教区宗座署理;1994年被擢升为总主教,同年年底被擢升为枢机;2009年2月离世。范廷颂于1919年6月15日在越南宁平省天主教发艳教区出生;童年时接受良好教育后,被一位越南神父带到河内继续其学业。范廷颂于1940年在河内大修道院完成神学学业。范廷颂于1949年6月6日在河内的主教座堂晋铎;及后被派到圣女小德兰孤儿院服务。1950年代,范廷颂在河内堂区创建移民接待中心以收容到河内避战的难民。1954年,法越战争结束,越南**共和国建都河内,当时很多天主教神职人员逃至越南的南方,但范廷颂仍然留在河内。翌年管理圣若望小修院;惟在1960年因捍卫修院的自由、自治及拒绝政府在修院设政治课的要求而被捕。1963年4月5日,教宗任命范廷颂为天主教北宁教区主教,同年8月15日就任;其牧铭为「我信天主的爱」。由于范廷颂被越南政府软禁差不多30年,因此他无法到所属堂区进行牧灵工作而专注研读等工作。范廷颂除了面对战争、贫困、被当局**天主教会等问题外,也秘密恢复修院、创建女修会团体等。1990年,教宗若望保禄二世在同年6月18日擢升范廷颂为天主教河内总教区宗座署理以填补该教区总主教的空缺。1994年3月23日,范廷颂被教宗若望保禄二世擢升为天主教河内总教区总主教并兼天主教谅山教区宗座署理;同年11月26日,若望保禄二世擢升    
'''
output=predictor.predict_generate_beamsearch(
    text, 
    out_max_length = 30
)
output

或者是这样的？

text = '''
问题:1994年3月,范廷颂担任什么职务?答案:[MASK] 范廷颂枢机(,),圣名保禄·若瑟(),是越南罗马天主教枢机。1963年被任为主教;1990年被擢升为天主教河内总教区宗座署理;1994年被擢升为总主教,同年年底被擢升为枢机;2009年2月离世。范廷颂于1919年6月15日在越南宁平省天主教发艳教区出生;童年时接受良好教育后,被一位越南神父带到河内继续其学业。范廷颂于1940年在河内大修道院完成神学学业。范廷颂于1949年6月6日在河内的主教座堂晋铎;及后被派到圣女小德兰孤儿院服务。1950年代,范廷颂在河内堂区创建移民接待中心以收容到河内避战的难民。1954年,法越战争结束,越南**共和国建都河内,当时很多天主教神职人员逃至越南的南方,但范廷颂仍然留在河内。翌年管理圣若望小修院;惟在1960年因捍卫修院的自由、自治及拒绝政府在修院设政治课的要求而被捕。1963年4月5日,教宗任命范廷颂为天主教北宁教区主教,同年8月15日就任;其牧铭为「我信天主的爱」。由于范廷颂被越南政府软禁差不多30年,因此他无法到所属堂区进行牧灵工作而专注研读等工作。范廷颂除了面对战争、贫困、被当局**天主教会等问题外,也秘密恢复修院、创建女修会团体等。1990年,教宗若望保禄二世在同年6月18日擢升范廷颂为天主教河内总教区宗座署理以填补该教区总主教的空缺。1994年3月23日,范廷颂被教宗若望保禄二世擢升为天主教河内总教区总主教并兼天主教谅山教区宗座署理;同年11月26日,若望保禄二世擢升    
'''
output=predictor.predict_generate_randomsample(
    text, 
    out_max_length = 30
)
output

输出好像都不太对啊

[MASK]前面的是问题，后面的是上下文。

使用FlagAI框架save的模型没有办法加载

使用gpt2_titile_generation文件夹中的train.py文件finetune模型，保存的结果在generate.py中无法使用

[Question]使用GLM模型训练TNews文本分类任务，准确率不高。

您好，下面是我使用的代码

import os
import numpy as np
import torch
from torch.utils.data import Dataset
from flagai.auto_model.auto_loader import AutoLoader
from flagai.trainer import Trainer
from flagai.metrics import accuracy_metric
from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
from flagai.data.dataset import ConstructSuperglueStrategy
from flagai.data.dataset.superglue.control import DEFAULT_METRICS, MULTI_TOKEN_TASKS, CH_TASKS


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

task_name = "tnews"
auto_loader = AutoLoader('classification',
                        model_name="GLM-large-ch",
                        model_dir="./checkpoints",
                        load_pretrain_params=True,
                        class_num=15)

cl_args = CollateArguments()
cl_args.cloze_eval = False
cl_args.multi_token = task_name in MULTI_TOKEN_TASKS

model = auto_loader.get_model()
tokenizer = auto_loader.get_tokenizer()

train_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='train',
                                 tokenizer=tokenizer)

collate_fn = ConstructSuperglueStrategy(cl_args,
                                        tokenizer,
                                        task_name=task_name)

valid_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='dev',
                                 tokenizer=tokenizer)

trainer = Trainer(
    env_type="pytorch",
    experiment_name="GLM_cls",
    batch_size=1,
    lr=1e-5,
    weight_decay=1e-5,
    epochs=10,
    log_interval=1,
    eval_interval=10000,
    pytorch_device=device,
    checkpoint_activations=False,
    save_dir="./glm_cls",
    save_interval=10000,
)

trainer.train(model,
              train_dataset=train_dataset,
              valid_dataset=valid_dataset,
              collate_fn=collate_fn,
              metric_methods=[["acc", accuracy_metric]])

训练到2W次迭代时，ACC达到61%，后面准确率越来越低，感觉像是过拟合了，但是准确率最高也只达到了61%，请问是啥原因造成的呢？

使用原始train_large_clue.py代码，loss一直震荡不收敛，准确率只有6%

import torch
from flagai.trainer import Trainer
from flagai.model.glm_model import GLMForSequenceClassification
from flagai.data.tokenizer import Tokenizer

from flagai.metrics import accuracy_metric
from flagai.data.dataset import SuperGlueDataset
from flagai.test_utils import CollateArguments
from flagai.data.dataset.superglue.control import DEFAULT_METRICS, MULTI_TOKEN_TASKS, CH_TASKS
from flagai.data.dataset import ConstructSuperglueStrategy


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

task_name = "tnews"
model_name = 'GLM-large-ch'
cl_args = CollateArguments()
cl_args.cloze_eval = False
cl_args.multi_token = task_name in MULTI_TOKEN_TASKS

tokenizer = Tokenizer.from_pretrained(model_name)

class_num = 15
model = GLMForSequenceClassification.from_pretrain(model_name=model_name, spell_length=2,
                                                   class_num=class_num, tune_prefix_layers=1)

train_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='train',
                                 tokenizer=tokenizer)

collate_fn = ConstructSuperglueStrategy(cl_args,
                                        tokenizer,
                                        task_name=task_name)

valid_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='dev',
                                 tokenizer=tokenizer)

trainer = Trainer(env_type='pytorch',
                  pytorch_device=device,
                  epochs=2,
                  batch_size=1,
                  lr=1e-5,
                  weight_decay=1e-5,
                  eval_interval=10000,
                  checkpoint_activations=False,
                  fp16=True,
                  log_interval=1000,
                  save_interval=10000,
                  save_dir="./glm_large_clue")

trainer.train(model,
              train_dataset=train_dataset,
              valid_dataset=valid_dataset,
              collate_fn=collate_fn,
              metric_methods=[["acc", accuracy_metric]])

[BUG] glm_generate_samples_en.py 输出有问题

运行如下代码：

import torch
from flagai.model.predictor.predictor import Predictor
from flagai.auto_model.auto_loader import AutoLoader
if __name__ == "__main__":
    """Main training program."""
    print('Generate Samples')
    # Random seeds for reproducibility.
    # Model,
    loader = AutoLoader(task_name='lm',
                                model_name='GLM-large-en',
                                only_download_config=False)
    model = loader.get_model()
    tokenizer = loader.get_tokenizer()
    model.cuda(torch.cuda.current_device())

    predictor = Predictor(model, tokenizer)
    # generate samples
    text = [
        'Question: Is drinking beer bad for your health? Answer: [gMASK]',
    ]
    for t in text:
        output = predictor.predict_generate_randomsample(
            t, top_k=50, repetition_penalty=4.0, top_p=1.0)
        print(t, '\n', output)

得到如下输出：

******************** lm glm-large-en
Question: Is drinking beer bad for your health? Answer: [gMASK] 
 [CLS] question : is drinking beer bad for your health ? answer : [gMASK] <|startofpiece|> , , 1 <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|> <|endofpiece|>

OPT-66B mp_size

Describe the bug
拆分OPT-66B模型时，提示不能被整除

Screenshots

[BUG] glm-10b-ch 模型不能正确inference

Describe the bug
A clear and concise description of what the bug is.

Tasks

An officially supported task in the examples folder (such as GLUE/Title-generation, ...)
My own task or dataset

To Reproduce

from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor

if __name__ == '__main__':
    loader = AutoLoader("seq2seq", "glm-10b-ch", model_dir="./checkpoints/")
    model = loader.get_model()
    tokenizer = loader.get_tokenizer()
    predictor = Predictor(model, tokenizer)

    text = "今天天气不错[gMASK]"
    output = predictor.predict_generate_beamsearch(text, out_max_length=5, beam_size=1)
    print(output)

结果会输出 ?? ?? ??，
debug内部发现给tokenizer decode之前的ID都是0

环境： Win10 x64， Python 3.10，FlagAI版本是pip上当前最新。默认似乎是使用CPU计算的，CPU有明显占用。

使用 AutoLoader("lm", "glm-10b-ch", model_dir="./checkpoints/") 也是一样的问题

请问有没有模型的下载地址？[Question]

Describe the question
A clear and concise description of what the question is.
您好，请问有没有模型的下载地址，代码下载的方法速度较慢，有没有百度云盘等模型下载链接，谢谢~
Additional context
Add any other context about the question here.

[BUG]

你好，请问 tutorial 中 GLM 标题生成的例子，是用多大的模型生成出来的？我使用 quick_start 中的 glm_title_ch.py 代码，用的是 glm-10b-ch 效果并理想

效果如下：

[BUG]

Describe the bug
A clear and concise description of what the bug is.

Tasks

An officially supported task in the examples folder (such as GLUE/Title-generation, ...)
My own task or dataset

To Reproduce

Error code

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

OS (please complete the following information):

OS: [e.g. ubuntu18.04]
Version [e.g. v1.0.0]

Additional context
Add any other context about the problem here.

显存不够

我在测试ALTDiffusion，文档中说只需要10G以上现存就可以，但是我12g的显存跑不起来，显存不够。请问这是为什么呢？

[BUG] error running quickstart/title_en.py

I've just installed the package locally and ran test code quickstart/title_en.py and got the following issues.

Any possible reasons? thanks!! see detail below

skys-MacBook-Pro:quickstart sky$ python3 title_en.py
******************** title-generation 100013 bert-base-en
Traceback (most recent call last):
File "title_en.py", line 29, in
print(predictor.predict_generate_beamsearch(text, out_max_length=50, beam_size=3))
File "../flagai/model/predictor/predictor.py", line 231, in predict_generate_beamsearch
return bert_beamsearch(self.model,
File "../flagai/model/predictor/utils.py", line 676, in bert_beamsearch
out_puts_ids = bert_beam_search(model,
File "../flagai/model/predictor/utils.py", line 280, in bert_beam_search
scores = bert_predict_generate(model, new_input_ids,
File "../flagai/model/predictor/utils.py", line 235, in bert_predict_generate
score = model(**{
File "/Users/sky/Library/Python/3.8/lib/python/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "../flagai/model/bert_model.py", line 359, in forward
encoder_out, pooler_out = self.model(
File "/Users/sky/Library/Python/3.8/lib/python/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "../flagai/model/bert_model.py", line 153, in forward
extended_attention_mask = extended_attention_mask * attention_mask
RuntimeError: The size of tensor a (3) must match the size of tensor b (171) at non-singleton dimension 2

ran title_cn.py got similar error
File "/Users/sky/Library/Python/3.8/lib/python/site-packages/flagai/model/layers/attentions.py", line 940, in forward
attention_scores += attention_mask
RuntimeError: output with shape [1, 12, 90, 90] doesn't match the broadcast shape [1, 1, 1, 12, 90, 90]

请问有文本分类的例子么？[Question]

现有的例子只包含文本生成和标题生成，有没有文本分类的样例呢？

[BUG]Connection timeout when excuting generate example of AltDiffusion

It seems that there is issue to establish connection to proxy of Huggingface to download safety checker model. Could we change the safety checker model download URL from Huggingface to Baai ModelHub?

Below is error output when runing python generate.py:

root@-0:~/FlagAI/examples/AltDiffusion# python generate.py
******************** text2img altdiffusion-m9
Extension horovod.torch has not been built: /usr/local/lib/python3.8/dist-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-38-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still avaiable.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 64, 64) = 16384 dimensions.
making attention of type 'vanilla' with 512 in_channels
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 358, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f2b1c1d1b80>: Failed to establish a new connection: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3-proxy.huggingface.tech', port=443): Max retries exceeded with url: /lfs.huggingface.co/repos/c3/33/c333b2b94c5a8a06ddcbb20b02e728f6bef192870028f8a6859247cabb771a03/64b8393f1afd5a0c1ed2aa5f341fa7c08286839a48f3743162a76a2835c808bd?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGOZQA2IKWK%2F20230104%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230104T011244Z&X-Amz-Expires=259200&X-Amz-Signature=6b4bb6d3d218d24cf8b030d2ee60679e3175ba64c350072718017b7701b01d02&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3D%22pytorch_model.bin%22&x-id=GetObject (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2b1c1d1b80>: Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2007, in from_pretrained
    resolved_archive_file = cached_path(
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 284, in cached_path
    output_path = get_from_cache(
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 594, in get_from_cache
    http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
  File "/usr/local/lib/python3.8/dist-packages/transformers/utils/hub.py", line 432, in http_get
    r = requests.get(url, stream=True, proxies=proxies, headers=headers)
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='s3-proxy.huggingface.tech', port=443): Max retries exceeded with url: /lfs.huggingface.co/repos/c3/33/c333b2b94c5a8a06ddcbb20b02e728f6bef192870028f8a6859247cabb771a03/64b8393f1afd5a0c1ed2aa5f341fa7c08286839a48f3743162a76a2835c808bd?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA4N7VTDGOZQA2IKWK%2F20230104%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230104T011244Z&X-Amz-Expires=259200&X-Amz-Signature=6b4bb6d3d218d24cf8b030d2ee60679e3175ba64c350072718017b7701b01d02&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3D%22pytorch_model.bin%22&x-id=GetObject (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f2b1c1d1b80>: Failed to establish a new connection: [Errno 110] Connection timed out'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate.py", line 19, in <module>
    predictor.predict_generate_images(
  File "/usr/local/lib/python3.8/dist-packages/flagai/model/predictor/predictor.py", line 342, in predict_generate_images
    safety_checker, safety_feature_extractor = get_safety_checker()
  File "/usr/local/lib/python3.8/dist-packages/flagai/model/predictor/utils.py", line 24, in get_safety_checker
    safety_checker = StableDiffusionSafetyChecker.from_pretrained(safety_model_id)
  File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2096, in from_pretrained
    raise EnvironmentError(
OSError: Can't load the model for 'CompVis/stable-diffusion-safety-checker'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'CompVis/stable-diffusion-safety-checker' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

BLOOM 1b1 and 3b model support

https://huggingface.co/bigscience/bloom-1b1
https://huggingface.co/bigscience/bloom-3b

[BUG]训练一个GPT2模型保留到本地，当调用load_weights方法初始化本地权重时报错

如题。
如下禁用transpose_weight方法时正常

def load_weights_without_trans(self, checkpoint_path):
    checkpoint = torch.load(checkpoint_path,
                                map_location=torch.device("cpu"))
    if "module" in checkpoint:
        # ddp
        checkpoint = checkpoint["module"]
    #checkpoint = self.transpose_weight(checkpoint)
    self.load_state_dict(checkpoint, strict=False)
    return checkpoint

是否有模型与HuggingFace transformers 模型相互转化的功能？

如题

请问V100单卡能跑得动 glm-10b-ch 的推理和 finetune 吗？

我在 V100 单卡上可以跑得动 glm-10b 英文的推理，但是跑 quickstart 中的任务时把模型改成 glm-10b-ch 就会 OOM

How can i fine-tune on my own data set?

[BUG] To change "baai-open" into "FlagAI-Open" in readme.md

Since we have moved the repo to FlagAI-Open, remember to change the README.md file.
There is the line as following.

git clone https://github.com/BAAI-Open/FlagAI.git

[BUG]missing multilingual information in the begining of AltDiffusion readme file

Congratulations on the new release of AltDiffusion-m9 which supporting 9 popular languages in the world. But when I was pointed to the link of exmple/AltDiffusion, I couldn't find any m9 information in the readme file, until I went into almost the end of readme file.

It will be good to add the multilingual support information in the very beginning of readme.

运行FlagAI框架gpt2_title_generation文件夹中的trian.py，在预测试报错

sample configs of img2img

Below is my code, the output is not good, I wander if the prompt is suitable. Could you give me some sample configs?

from diffusers import AltDiffusionPipeline, EulerDiscreteScheduler
from PIL import Image
from diffusers import AltDiffusionImg2ImgPipeline

if __name__ == '__main__':
    text2img = AltDiffusionPipeline.from_pretrained("BAAI/AltDiffusion-m9")
    img2img = AltDiffusionImg2ImgPipeline(**text2img.components)
    img2img = img2img.to("cuda")

    img = Image.open('input/高圆圆.jpeg')
    out_imgs = img2img(prompt="((masterpiece)), (((best quality))), ((ultra-detailed)), ((illustration)), girl, genshin impact,vision",\
                       init_image=img, strength=0.7,\
                       guidance_scale=30,\
                       negative_prompt='nsfw, longbody, lowres, bad anatomy, bad hands, missing fingers, pubic hair,extra digit, fewer digits, cropped, worst quality, low quality').images[0]
    out_imgs.save(f'output.png')

有GLM在 CMRC上下游fine-tuned 之后的模型吗？

1、如题有在阅读理解上调试后的模型吗？
2、而且 predictor 构造的模版分布于collect_fn中

        elif self.task_name in ["cmrc"]:
            mask_id = self.tokenizer.get_command_id('MASK')
            source_text = example.text_a
            target_text = example.meta["answer"].strip()
            question = example.meta["question"].strip()
            source_tokens = self.tokenizer.EncodeAsIds(source_text.rstrip())
            question_tokens = self.tokenizer.EncodeAsIds("问题：" + question +
                                                         "答案：")
            max_src_length = self.args.max_src_length - len(
                question_tokens) - 2
            if max_src_length <= 0:
                question_tokens = question_tokens[self.args.max_src_length //
                                                  4]
            source_tokens = [cls_id] + question_tokens + [
                mask_id
            ] + source_tokens[:max_src_length]

是否考虑将这部分做文档进行说明。

3、数据集的导入和预处理是依赖于具体数据集名称而不依赖于更一般的任务格式和数据集格式会增加用户模仿数据集输入格式的成本。

OPT finetuning

请问有没有OPT每种模型大小的资源使用情况？

[BUG] superGLUE example bug

Describe the bug

key error from superGLUE example

task_name = 'qqp'
trainer = Trainer(env_type='pytorch',
                 pytorch_device="cuda",
                  epochs=2,
                  batch_size=1,
                  eval_interval=1000,
                  checkpoint_activations=False,
                  fp16=True,
                  log_interval=1,
                  save_dir="./glm_superglue_en",
                  # master_ip='127.0.0.1',
                  # master_port=17755,
                  # num_nodes=1,
                  # num_gpus=2,
                  # hostfile='./hostfile',
                  model_parallel_size=2,
                  deepspeed_config='./deepspeed.json',
                  training_script=__file__)

model = GLMForSingleTokenCloze.from_pretrain(download_path="/mnt/test_10b_models",
                                             model_name="GLM-large-en")

tokenizer = GLM10bENBPETokenizer()

train_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='train',
                                 tokenizer=tokenizer,
                                 cloze_eval=True)
valid_dataset = SuperGlueDataset(task_name=task_name,
                                 data_dir='./datasets/',
                                 dataset_type='dev',
                                 tokenizer=tokenizer,
                                 cloze_eval=True)

cl_args = CollateArguments()
cl_args.cloze_eval = True

if task_name in ['copa', 'wsc', 'record']:
    cl_args.multi_token = True

from flagai.data.dataset import ConstructSuperglueStrategy

collate_fn = ConstructSuperglueStrategy(cl_args,
                                        tokenizer,
                                        task_name=task_name)
trainer.train(model,
              train_dataset=train_dataset,
              valid_dataset=valid_dataset,
              collate_fn=collate_fn,
              metric_methods=[["acc", accuracy_metric]])

Tasks

An officially supported task in the examples folder (such as GLUE/Title-generation, ...)
My own task or dataset

To Reproduce

Creating qqp dataset from file at ./datasets/ (split=train)
Returning 363846 train examples with label dist.: [('0', 229468), ('1', 134378)]
Creating qqp dataset from file at ./datasets/ (split=dev)
Returning 40430 dev examples with label dist.: [('0', 25545), ('1', 14885)]
Optimizer = Adam
[2022-06-08 17:54:06,911] [INFO] [logger.py:70:log_dist] [Rank -1] loading checkpoints form checkpoints/99
[2022-06-08 17:54:06,912] [INFO] [logger.py:70:log_dist] [Rank -1] WARNING: could not find the metadata file checkpoints/99/latest_checkpointed_iteration.txt
[2022-06-08 17:54:06,912] [INFO] [logger.py:70:log_dist] [Rank -1]     will not load any checkpoints and will start from random
[2022-06-08 17:54:06,912] [INFO] [logger.py:70:log_dist] [Rank -1] working on epoch 0 ...
Traceback (most recent call last):
  File "train_10b_superglue.py", line 59, in <module>
    trainer.train(model,
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/trainer.py", line 448, in train
    for iteration_, batch in enumerate(train_dataloader):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/data/dataset/data_collator/collate_fn.py", line 105, in __call__
    sample = self.pvp.encode(example, {})
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/data/dataset/superglue/pvp.py", line 195, in encode
    raw_parts_a, raw_parts_b = self.get_parts(example)
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/data/dataset/superglue/pvp.py", line 1493, in get_parts
    return [text_a], [" Do you mean ", text_b, [self.mask], "."]
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/data/dataset/superglue/pvp.py", line 99, in mask
    return self.tokenizer.get_command('MASK').Id
  File "/opt/conda/lib/python3.8/site-packages/flagai-1.0.1-py3.8.egg/flagai/data/tokenizer/tokenizer.py", line 172, in get_command
    return self.command_name_map[name]
KeyError: 'MASK'

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

OS (please complete the following information):

Version [ v1.0.1]

Any plan for Swin Transformer?

Is there any plan for Swin Transformer?

altdiffusion-m9 is not be supported

Hi 按照以下程序运行后会出现The model_name: altdiffusion-m9 is not be supported的错误

import torch
from flagai.auto_model.auto_loader import AutoLoader
from flagai.model.predictor.predictor import Predictor

# Initialize 
prompt = "Anime portrait of natalie portman as an anime girl by stanley artgerm lau, wlop, rossdraws, james jean, andrei riabovitchev, marc simonetti, and sakimichan, trending on artstation"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


loader = AutoLoader(task_name="text2img", #contrastive learning
                    model_name="AltDiffusion-m9",
                    model_dir="./checkpoints")

model = loader.get_model()
model.eval()
model.to(device)
predictor = Predictor(model)
predictor.predict_generate_images(prompt)

两台3090能达到微调`GLM-10b-ch`的要求吗？

我在两台3090上对GLM-10b-ch进行微调时到验证阶段总是会显存不足，但是训练阶段不会，想知道是两台3090不足以对GLM-10b-ch进行微调还是我的参数设置的有问题？
下面是我训练时使用的参数：
Trainer:

trainer = Trainer(
    env_type="deepspeed+mpu",
    epochs=10,
    experiment_name="GLM-10b-ch-seq2seq",
    eval_interval=2000,
    log_interval=100,
    load_dir=None,
    # parallel settings
    master_ip='127.0.0.1',
    master_port=17750,
    num_nodes=1,
    num_gpus=2,
    hostfile='hostfile',
    training_script=__file__,
    # deepspeed
    deepspeed_config='./config/deepspeed.json',
    # megatron-lm
    model_parallel_size=2,
    save_dir="checkpoints_glm_title_generation",
    save_interval=1,
    num_checkpoints=3,
)

deepspeed.json:

{
    "train_micro_batch_size_per_gpu": 16,
    "eval_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 2,
    "steps_per_print": 100,
    "gradient_clipping": 1.0,
    "zero_optimization": {
      "stage": 3,
      "contiguous_gradients": false,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": 5e7,
      "allgather_bucket_size": 5e7,
      "cpu_offload": true 
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": true,
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
    },
    "optimizer": {
      "type": "Adam",
      "params": {
        "lr": 0.000005,
        "weight_decay": 0.01,
        "betas": [
          0.9,
          0.98
        ],
        "eps": 1e-6
      }
    },
    "activation_checkpointing": {
      "partition_activations": false,
      "contiguous_memory_optimization": false
    },
    "wall_clock_breakdown": false
  }

[BUG] GPT2Config 没有dict属性 Trainer 对它进行属性改变时不可调用字典接口，亦不可json序列化

Code in Trainer

        if hasattr(tmp_model,
                   'config') and 'checkpoint_activations' in tmp_model.config:
            tmp_model.config[
                'checkpoint_activations'] = tmp_checkpoint_activations

Code in uitils.py

        if hasattr(model, 'save_config'):
            model.save_config(config_path)
            log_dist('  successfully saved {}'.format(config_path))

[BUG] Errors in MLM training of Bert

Describe the bug
There is a error when I try to finetune bert model on masked langguage model learning task.

Tasks

An officially supported task in the examples folder (such as GLUE/Title-generation, ...)
My own task or dataset

To Reproduce
https://github.com/marscrazy/Tab2NL/blob/train_with_flagai/train_our_flagai.py

import os
import argparse
from data import get_dataset
from sklearn.metrics import roc_auc_score
import numpy as np
import random
import time
import torch
from flagai.trainer import Trainer
from flagai.auto_model.auto_loader import AutoLoader
from transformers import  DataCollatorForLanguageModeling, AutoTokenizer

def set_seed(SEED):
    torch.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)
    np.random.seed(SEED)
    random.seed(SEED)
    #torch.backends.cudnn.deterministic = True
set_seed(26)

def compute_metrics(predictions, labels, meta=None):
    predictions = predictions[:,1]
    return {'roc_auc':roc_auc_score(labels,predictions)}

class txtDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def finetuning_model(
        train_x, train_y, val_x, val_y, cv_fold=1, dataset_id=11,
        model_dir = "bert-base-ch", #bert-base-uncased
        is_mlm = False,
        num_train_epochs=10, #10
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=32,  # batch size for evaluation
        warmup_steps=200,  # number of warmup steps for learning rate scheduler
        weight_decay=0.1,  # strength of weight decay
        logging_steps=100,#20
        seed=11,
        learning_rate=4e-5,
        metric_for_best_model=None,
        config = None,
        tokenizer = None,
        model = None,
        output_dir = None,
        logging_dir = None,
        return_model = False
):
    if output_dir is None:
        output_dir = './results/'+str(dataset_id)+'-cv-'+str(cv_fold)+'-mlm'
    if logging_dir is None:
        logging_dir = './logs/'+str(dataset_id)+'-cv-'+str(cv_fold)+'-mlm'
    #if config is None:
        # config = AutoConfig.from_pretrained(model_dir)
    #    import json
    #    config = json.load(open('./checkpoints/BERT-base-en/config.json'))

    if model is None:
        if is_mlm:
            auto_loader = AutoLoader(
            "masklm",
            model_name="BERT-base-en",
            model_dir='./checkpoints',
            )
        else:
            auto_loader = AutoLoader(
            "classification",
            model_name="BERT-base-en",
            model_dir='./checkpoints',
            class_num = 2
            )
        model = auto_loader.get_model()
        tokenizer = AutoTokenizer.from_pretrained("./checkpoints/BERT-base-en")
    train_encodings = tokenizer(train_x.tolist(), truncation=True, padding=True)
    val_encodings = tokenizer(val_x.tolist(), truncation=True, padding=True)
    train_dataset = txtDataset(train_encodings, train_y.astype(np.longlong))
    val_dataset = txtDataset(val_encodings, val_y.astype(np.longlong))
    if is_mlm:
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=tokenizer,
            mlm_probability=0.15
        )
    class MyTrainer(Trainer):
        def forward_step(self, data, model, mems):
            model_output = model(**{'input_ids':data['input_ids'],
                                  'segment_ids':data['token_type_ids'],
                                  'attention_mask':data['attention_mask']
                                 })
            print(model_output)
    trainer = MyTrainer(
        env_type='pytorch',
        epochs=num_train_epochs,
        weight_decay=weight_decay,
        log_interval=logging_steps,
        seed=seed,
        lr=learning_rate,
        save_dir=output_dir,
        tensorboard_dir=logging_dir
    )
    trainer.train(model=model,  # the instantiated 🤗 Transformers model to be trained
        train_dataset=train_dataset,  # training dataset
        valid_dataset=val_dataset,  # evaluation dataset
        metric_methods=[compute_metrics] if not is_mlm else [],
        collate_fn=data_collator if is_mlm else None)

    dir_name = os.listdir(output_dir)[0]
    cur_model_dir = os.path.join(output_dir,dir_name)
    del model
    torch.cuda.empty_cache()
    time.sleep(5)
    if return_model:
        return cur_model_dir, tokenizer, config
   
def train_ptm_cls(train_x,train_y,val_x, val_y, test_x, test_y, cv_fold=1, dataset_id=11,tokenizer=None, config=None,
                  model_dir = "../contrastive/resources/bert-base-uncased"):

    train_encodings = tokenizer(train_x.tolist(), truncation=True, padding=True)
    val_encodings = tokenizer(val_x.tolist(), truncation=True, padding=True)
    test_encodings = tokenizer(test_x.tolist(), truncation=True, padding=True)

    train_dataset = txtDataset(train_encodings, train_y.astype(np.longlong))
    test_dataset = txtDataset(test_encodings, test_y.astype(np.longlong))
    val_dataset = txtDataset(val_encodings, val_y.astype(np.longlong))
    model = AutoModelForSequenceClassification.from_pretrained(model_dir , config=config, from_tf=False,num_labels=2)
    output_dir = './results/'+str(dataset_id)+'-cv-'+str(cv_fold)+'-cls'
    log_dir = './logs/'+str(dataset_id)+'-cv-'+str(cv_fold)+'-cls'
    training_args = TrainingArguments(
        output_dir=output_dir,  # output directory
        num_train_epochs=10,  # total number of training epochs 
        per_device_train_batch_size=32,  # batch size per device during training
        per_device_eval_batch_size=32,  # batch size for evaluation
        warmup_steps=1000,  # number of warmup steps for learning rate scheduler
        weight_decay=0.1,  # strength of weight decay
        logging_dir=log_dir,  # directory for storing logs
        logging_steps=10, 
        eval_steps=10,
        save_steps=10,
        save_total_limit=1,
        do_eval=True,
        evaluation_strategy='steps',
        learning_rate=2e-5,
        seed=11,
        #save_strategy='steps',
        load_best_model_at_end=True,
        metric_for_best_model="roc_auc"
    )
    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=train_dataset,  # training dataset
        eval_dataset=test_dataset,  # evaluation dataset
        compute_metrics=compute_metrics
        #optimizers=(optimizer,None)
    )
    trainer.train()
    train_rs = trainer.evaluate(train_dataset)
    test_rs = trainer.evaluate(test_dataset)
    val_rs = trainer.evaluate(val_dataset)
    return train_rs['eval_roc_auc'], val_rs['eval_roc_auc'],test_rs['eval_roc_auc']


def train(dataset_id=1):
    ds = get_dataset(dataset_id=dataset_id)
    rs = []
    for i, (train_x, val_x, test_x, train_y, val_y, test_y) in enumerate(ds.generate_datasets(to_txt=True)):
        model_dir,tokenizer, config = finetuning_model(train_x,train_y,val_x, val_y,cv_fold=i, dataset_id=dataset_id,
                  model_dir = "../contrastive/resources/bert-base-uncased",is_mlm=True)
        train_auc, val_auc, test_auc = finetuning_model(
            train_x,train_y,val_x, val_y, test_x, test_y, cv_fold=i,dataset_id= dataset_id,tokenizer=tokenizer, config= config,
                  model_dir = model_dir,is_mlm=False)
        rs.append((train_auc,val_auc,test_auc))
        print("Train auc {:.3f}, val auc {:.3f}, Test auc {:.3f}".format(train_auc, val_auc, test_auc))

    for x,y,z in rs:
        print("Train auc {:.3f}, Val auc {:.3f}, Test auc {:.3f}".format(x,y,z))
    print("avg auc is {:.3f}\t{:.3f}".format(np.mean([x[-1] for x in rs]), np.std([x[-1] for x in rs])))
    #train_xgb(ds)

if __name__=="__main__":
    parser = argparse.ArgumentParser(description='Train Classifier with mixup', formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    # Data
    parser.add_argument('--model_dir', type=str, default='H:\\contrast\\SimCSE-main\\SimCSE-main\\bert-base-uncased',help='the path to pretrained models')
    parser.add_argument('--dataset_id', type=str, default='11',choices=['1','2','3','4','5','6','7','8','9','10','11'], help='Choose between 1-11.')
    # MLM pretrain
    parser.add_argument('--mlm_warmup_steps', default=1000, type=int, metavar='N', help='warmup steps (default: 1000)')
    parser.add_argument('--mlm_learning_rate', type=float, default=2e-5)
    parser.add_argument('--mlm_decay', type=float, default=0.1, help='weight decay (L2 penalty)')
    parser.add_argument('--mlm_epochs', type=int, default=300, help='number of epochs to train')
    parser.add_argument('--mlm_train_batch_size', type=int, default=32)
    parser.add_argument('--mlm_eval_batch_size', type=int, default=32)
    parser.add_argument('--mlm_logging_steps', default=10, type=int, metavar='N', help='logging frequency (default: 10)')
    # text classification
    parser.add_argument('--cls_epochs', type=int, default=300, help='number of epochs to train')
    parser.add_argument('--cls_train_batch_size', type=int, default=32)
    parser.add_argument('--cls_eval_batch_size', type=int, default=32)
    parser.add_argument('--cls_warmup_steps', default=1000, type=int, metavar='N', help='warmup steps (default: 1000)')
    parser.add_argument('--cls_decay', type=float, default=0.1, help='weight decay (L2 penalty)')
    parser.add_argument('--cls_logging_steps', default=10, type=int, metavar='N', help='logging frequency (default: 10)')
    parser.add_argument('--cls_learning_rate', type=float, default=2e-5)
    # Optimization options
    #parser.add_argument('--train', type=str, default='vanilla', choices=['vanilla', 'mixup', 'mixup_hidden', 'SRRS'], help='mixup layer')
    # training
    #parser.add_argument('--momentum', type=float, default=0.9)
    #parser.add_argument('--schedule', type=int, nargs='+', default=[150, 225], help='decrease learning rate at these epochs')
    #parser.add_argument('--gammas', type=float, nargs='+', default=[0.1, 0.1], help='LR is multiplied by gamma on schedule, number of gammas should be equal to schedule')

    # Checkpoints
    parser.add_argument('--resume', default='', type=str, metavar='PATH', help='path to latest checkpoint (default: none)')
    parser.add_argument('--start_epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)')
    # random seed
    parser.add_argument('--seed', default=0, type=int, help='manual seed')
    parser.add_argument('--add_name', type=str, default='')
    parser.add_argument('--job_id', type=str, default='')
    args = parser.parse_args()
    ds = get_dataset(dataset_id=int(args.dataset_id))
    rs = []
    for i, (train_x, val_x, test_x, train_y, val_y, test_y) in enumerate(ds.generate_datasets(to_txt=True,with_title=True if args.dataset_id not in ['1','3'] else False)):
        model_dir,tokenizer, config = finetuning_model(train_x, train_y, val_x, val_y,cv_fold=i, dataset_id=args.dataset_id,
        model_dir = "hkunlp/T5_large_prefix_all_tasks_2upsample2",#bert-base-uncased,hkunlp/from_all_T5_large_prefix_sql2text2
        is_mlm = True,
        num_train_epochs=10,  #args.mlm_epochs,10
        per_device_train_batch_size=args.mlm_train_batch_size,  # batch size per device during training
        per_device_eval_batch_size=args.mlm_eval_batch_size,  # batch size for evaluation
        warmup_steps=args.mlm_warmup_steps,  # number of warmup steps for learning rate scheduler
        weight_decay=args.mlm_decay,  # strength of weight decay
        logging_steps=100,#20
        seed=11,
        learning_rate=4e-5,
        metric_for_best_model=None,
        config = None,
        tokenizer = None,
        model = None,
        output_dir = None,
        logging_dir = None,
        return_model = False)
        
        model_dir,tokenizer,config, trainer= finetuning_model(
            train_x, train_y, val_x, val_y, cv_fold=i,dataset_id= args.dataset_id,tokenizer=tokenizer, config= config,
                  model_dir = model_dir,is_mlm=False, return_model=True)
        test_encodings = tokenizer(test_x.tolist(), truncation=True, padding=True)
        test_dataset = txtDataset(test_encodings, test_y.astype(np.longlong))
        test_auc = trainer.evaluate(test_dataset)['eval_roc_auc']
        rs.append(test_auc)
        print("Test auc {:.3f}".format(test_auc))
    print("avg auc is {:.3f}\t{:.3f}".format(np.mean(rs),np.std(rs)))

Expected behavior
fine-tuning BERT on MLM and classification tasks

Screenshots
If applicable, add screenshots to help explain your problem.

OS (please complete the following information):

OS: [e.g. ubuntu18.04]
Version [e.g. v1.0.0]

[BUG] 这个工程对于GPU和CPU的区分不太好，好像不能用在CPU上

代码好像没有区分device

run error

ubuntu

python3.9 run

loader = AutoLoader(task_name="lm", model_name="opt-1.3b-en")

The following error occurred

self.wte = nn.Embedding(config.vocab_size, config.n_embd)
AttributeError: 'dict' object has no attribute 'vocab_size'

flagai-open / flagai Goto Github PK

flagai's People

Contributors

Stargazers

Watchers

Forkers

flagai's Issues

训练到2W次迭代时，ACC达到61%，后面准确率越来越低，感觉像是过拟合了，但是准确率最高也只达到了61%，请问是啥原因造成的呢？

使用原始train_large_clue.py代码，loss一直震荡不收敛，准确率只有6%

ubuntu

python3.9 run

The following error occurred

Recommend Projects

Recommend Topics

Recommend Org