shannonai / chinesebert Goto Github PK

View Code? Open in Web Editor NEW

538.0 538.0 92.0 280 KB

Code for ACL 2021 paper "ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information"

License: MIT License

Python 95.08% Shell 4.92%

chinesebert's People

Contributors

Stargazers

Watchers

Forkers

xiaoya-li wellhowtosay spring-quan snaildm li-rr yyht kiminh anshiquanshu66 vogaliccb barryzm louisheck trendingtechnology hxystudy qing-1997 shenyi666666 bowendoctor1616 duxiaochao zijunsun mars-wei gdh756462786 nanqiai hawksilent zhengtong0807 snowlixue tiexueyl akari0216 xuanfang1121 zjjhit111 zjjhit yahuiduan lizezhonglaile gshan4056 keykaren wjn1996 duzhiyuan1022 huajinghua chinaliwenbo xuehuiping hechain kangzhonghua yanzhelee dk158 nolophe quaeast liumaishen maxmax2016 nicexw dumpmemory duneryc jokertail onthem00n guaomei tjudoubi mcx yandun72 sanforfive wangwang110 sherryran08 dark-existed shuhewang1998 conan1024hao lavender-sk yingfenging whz-nj mazhihao1122 meverystrong zhangli344236745 ghostxu11 old-young233 sadgen silencioo gavince zhitong-y zhxfei avalonss haojiepan1 tyttyttyt yuang1516 jylink guoxinyi911 lili0710432 hufeihu xrexile 46319943 iiosnail cpetoile haojunzuo hippoley sunnyhuma171 qyw233 pussycat0700 star00star

chinesebert's Issues

请问可否提供清洗后的预训练数据？

论文中提到你们从CommonCrawl中筛选了约10%的数据，请问可否提供这份数据，或相应的清洗脚本呢~

请问from pypinyin import pinyin, Style中pypinyin的代码有提供吗，在code里没有找到

关于CPU

我想请问一下，在运行tasks中的LCQMC示例的时候，如果没有GPU该如何调整

python LCQMC_trainer.py --bert_path ../ChineseBERT-base/ --data_dir E:/PycharmProjects/ChineseBert/lcqmc/ --save_path ../../cus_output/ --max_epoch=7 --lr=2e-5 --batch_size=16 --gpus=0最后一个参数可以不写吗？需要如何调整参数和修改代码

the error of textClassifier task

Hello，I have a question about textClassifier task，it is no error in thucnews data,but there are many errors in my own data (all codes are the same to your codes), the answer of this question in internet is that the lable_map should begin 0，eg : {"体育": 0, "娱乐": 1, "家居": 2}, the same error is still existence after I follow this answer, can you solve my question?t hank you!

请问下在NER任务中，pinyin_map.json, id2pinyin.json, pinyin2tensor.json这些文件是怎么生成的？

如何用自己的数据进一步预训练

您好！请问您有模型预训练的代码吗？尝试使用run_mlm.py[https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling/run_mlm.py]进行进一步预训练，但代码中调用的tokenizer和您的模型中的tokenizer（BertMaskDataset）不同，替换后遇到了许多问题，希望您可以提供帮助~谢谢！

训练的细节

请问你们是在什么设备上训练的，进行了多长时间

ner

下接crf为啥loss一直降不下去

关于字典的一个小问题

您好，vocab.txt文件中344，21129行都是空行，有什么区别吗？

你好我遇到了一个gpu、

Some weights of the model checkpoint at E:\CODE\pythonProject\ChineseBert-main\CHINESEBERT_PATH were not used when initializing GlyceBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']

This IS expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GlyceBertForSequenceClassification were not initialized from the model checkpoint at E:\CODE\pythonProject\ChineseBert-main\CHINESEBERT_PATH and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
D:\Anaconda\envs\py37torch\lib\site-packages\pytorch_lightning\utilities\distributed.py:37: UserWarning: WORLD_SIZE environment variable (2) is not equal to the computed world size (1). Ignored.
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

请问这是怎么回事呢？没有查到解决办法

run ChnSetiCorp_trainer.py

AttributeError: 'function' object has no attribute 'endswith'，请问这个错误要怎么解决呀

Cannot Run your scripts

Hello Authors

What is the version of the transformers are you using? I was not able to execute your code. Moreover, if there are anymore dependencies please let us know.

关于segment embedding

文中提到因为预训练没有使用NSP任务而忽略了segment embedding，为什么源码fusion_embedding.py里还是出现了token_type_embeddings，虽然是用0初始化的

ChineseBert/models/fusion_embedding.py

Line 34 in 53d9c25

 self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size) 

ChineseBert/models/fusion_embedding.py

Lines 60 to 61 in 53d9c25

 if token_type_ids is None: 

 token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

ChineseBert/models/fusion_embedding.py

Line 75 in 53d9c25

token_type_embeddings = self.token_type_embeddings(token_type_ids)

ChineseBert/models/fusion_embedding.py

Line 77 in 53d9c25

embeddings = inputs_embeds + position_embeddings + token_type_embeddings

In ChineseBERT-large model, what is the true content of pinyin.json?

I ran the command and got the error.

The content of pinyin.json is not a normal json file.
I wonder if something more need to be done before I use the ChineseBERT-large model.

Thanks in advance!

关于MSRA数据集的F1值

您好，许多论文在MSRA的数据集上的评估效果都是在95-96左右，而在您的论文中即使是BERT也达到了99，想请问一下您是否对该数据集做了预处理/或者说您论文中所使用的数据集和公开的有一些不同呢？
如果是对数据集做了处理能方便公开一下该数据集么

单个GPU参数如何配置

写了--gpus=0，但是识别不了，出现报错

关于得到每个字字形向量

我目前是想要计算两个字字形之间相似度，需要得到字形向量

不知道hugging中哪个文件用于计算字形，以及字形向量的获得是直接model/glypg_embedding.py就可以获得吗

请问一下，为什么tasks/CMRC/cmrc_trainer.py ，CMRC任务训练保存只能保存最开始的checkpoint

比如，我设置save topk=1 ，那么保存了一个checkpoint之后尽管还在一直训练但不会再保存更新checkpoint

关于tokenizer

在拜读大神的文章和代码，对于tokenizer这里有些不解。
常规的tokenizer可以得到input_ids,
您的代码里的tokenizer可以同时得到input_ids和pinyin_ids,感觉很是神奇，
可否开源您训练tokenizer的代码呀

预训练模型可否添加网盘地址

How to use CLUE

I can not find entrance to upload prediction file.

How to get the valid set and test set about XNLI?

Hello，could you tell me how to get the valid set and test set about XNLI? In the link, I only found train set.

请问是否有tf2.x的支持

想应用这个模型在领域内在微调一下，请问是否提供tf的支持呢

请教问题

你好，我记得论文中有做全词掩码策略，为什么在代码中没看到？？
谢谢！！

请问大佬总是出现这种错误怎么办，不知道怎么改

(yl) D:\ChineseBert-main\tasks\THUCNew>python THUCNews_trainer.py --bert_path ./111/ --data_dir ./cnews/ --save_path ./222/ --max_epoch=5 --lr=2e-5 --batch_size=8 --gpus=0
Some weights of the model checkpoint at ./111/ were not used when initializing GlyceBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']

This IS expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GlyceBertForSequenceClassification were not initialized from the model checkpoint at ./111/ and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "THUCNews_trainer.py", line 229, in
main()
File "THUCNews_trainer.py", line 193, in main
model = ChnSentiClassificationTask(args)
File "THUCNews_trainer.py", line 51, in init
self.model = GlyceBertForSequenceClassification.from_pretrained(self.bert_dir)
File "d:\ProgramData\Anaconda3\envs\yl\lib\site-packages\transformers\modeling_utils.py", line 1071, in from_pretrained
model.class.name, "\n\t".join(error_msgs)
RuntimeError: Error(s) in loading state_dict for GlyceBertForSequenceClassification:
size mismatch for bert.embeddings.glyph_embeddings.embedding.weight: copying a param with shape torch.Size([23236, 1728]) from checkpoint, the shape in current model is torch.Size([23236, 1152]).

ChnSetiCorp中加载字体文件错误

在运行ChnSetiCorp.py时,报错: Failed to interpret file 'ChineseBERT-base\config\._STFANGSO.TTF24.npy' as a pickle
请问这种问题该如何解决,在网上也没有找到答案.

关于parser参数中使用了dropout，但是在代码的其它部分却没有使用到这个参数

parser.add_argument("--hidden_dropout_prob", default=0.1, type=float, help="dropout probability")
代码中设置了 hidden_dropout_prob，但是却通篇找不到使用了参数的地方

比如
self.bert = GlyceBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
modeling_glycebert.py中253行，直接使用了hidden_dropout_prob

复现只有72%，跟论文里的78%差的有点多

运行问题

请问模型如果在pycharm里面可以直接运行吗？

关于预训练

请问是否有开源预训练的代码呢

使用weibo task，但是使用的是自己的数据，为什么计算出来的模型精度为0

如上所述

想问一下直接使用测试怎么调整

已经训练过LCQMC_trainer。并且保存ckpt文件，如何使用ckpt文件进行test。
使用model = BQTask.load_from_checkpoint("F:/PycharmProjects/ChineseBert/output/bq/checkpoint/epoch=3-val_loss=0.5043-val_acc=0.8612.ckpt")加载保存的ckpt文件，会找不到参数

请问能开源怎么做mlm的预训练吗

请教readme里面的那个示例的一个问题

刚开始用pytorch的新手，就是readme里面的那个示例“我喜欢猫”，多次调用 chinese_bert.forward输出的结果由于模型里面有dropout不是应该不同吗，为什么我多次调用输出的结果总是一样的呢？

使用性问题

您好，首先恭喜这份工作被ACL2021录用，融入字形和拼音的预训练必然会对中文nlp任务带来一定的提升。
同样，我也希望能在除了论文中提及的其他任务中使用ChineseBert，请问有没有集成类似于BERT的API可以调用，
如：
tokenizer = Tokenizer.from_pretrain([ChineseBert])
config = Config.from_pretrain([ChineseBert])
model = Bert.from_pretrain([ChineseBert])
或者，有没有instruction说明一下调用方式

接口调用时加载字体图片报错

作者您好，感谢您的工作！请问我在依照quick tour中的实例调用接口时产生报错：ValueError: cannot reshape array of size 3555312 into shape (23236,24,24)，问题来自np.load(np_file).astype(np.float32) for np_file in font_npy_files，可以帮忙看看是什么原因吗？（是字体npy文件下载的有问题吗，重新下载了几遍都未解决，各依赖包的版本也都是完全按照requirement.txt中配置的）再次感谢您的工作和帮助~

您好，请问我想在fusion_embedding中删除一个embedding，遇到一个问题，请问该如何解决？十分感谢

File "/root/ChineseBert/ChineseBert-main/models/fusion_embedding.py", line 72, in forward
inputs_embeds = self.map_fc(concat_embeddings)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear
output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

关于启动训练代码报错

你好，请问一下，环境的安装和您的一致，为什么会报如下的错误，是不是python版本的问题，我用的是python3.8,望解答

新手请教路径问题

关于bert_dataset中
vocab_file = "/data/nfsdata2/sunzijun/glyce/glyce/bert_chinese_base_large_vocab/vocab.txt"
config_path = "/data/nfsdata2/sunzijun/glyce/glyce/config" 是改为CHINESEBERT_PATH中的路径吗

另外在bert_dataset中
tokenizer = BertMaskDataset(vocab_file, config_path) 的BertMaskDataset报红，
我试着在文件头加入from bert_mask_dataset import BertMaskDataset
提示ModuleNotFoundError: No module named 'bert_mask_dataset'

感谢您的查阅。

预训练细节

请问在你们预训练过程中，使用100GB的数据，是通过什么方式载入到dataset的。
据我所知，一般数据集不大的情况下是直接在dataset的__init__方法载入数据直接读进内存，但是数据量大的情况下采用什么方式?
我看到torch有提供一个iterabledataset，但是如果使用这个dataset的话，是不是就没有办法进行shuffle了呢?

关于tour

The version conflict of the pytorch-lightening

Hello, I have met a version conflict and unable to manage it. (I am not quite familiar with the pytorch-lightning)

This expression here seems to have been deprecated in the newer version, this call back mechanism cannot be found in the User guide for pytorch lightning and the errrorpage also says that

so i changed it into the below way

and the afore-mentioned problem is gone, instead this thing appear

It seems tobe some what run out of num, how ever the dataset i custommed before was verified using the supplied method

There is also another possible thing that's probably going on, which is the relative path in use. But that is not likely to be the case

Validation sanity check: 0it [00:00, ?it/s]thread '<unnamed>' panicked at 'no entry found for key', D:\a\tokenizers\tokenizers\tokenizers\src\models\mod.rs:36:66
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "ChnSetiCorp_trainer.py", line 227, in <module>
    main()
  File "ChnSetiCorp_trainer.py", line 216, in main
    trainer.fit(model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1084, in fit
    results = self.accelerator_backend.train(model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 39, in train
    results = self.trainer.run_pretrain_routine(model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1224, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1257, in _run_sanity_check
    eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 305, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
    w.start()
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
pyo3_runtime.PanicException: no entry found for key

Based on huggingface/tokenizers#260: it suggests that vocab.json miss one entry, but I try with tokenizer.encode(sentence) with all lines in ChnSentiCorp, it works

	if token_type_ids is None:
	token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

shannonai / chinesebert Goto Github PK

chinesebert's People

Contributors

Stargazers

Watchers

Forkers

chinesebert's Issues

Recommend Projects

Recommend Topics

Recommend Org