Giter Club home page Giter Club logo

chinesebert's People

Contributors

littlesulley avatar xiaoya-li avatar zijunsun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinesebert's Issues

关于CPU

我想请问一下,在运行tasks中的LCQMC示例的时候,如果没有GPU该如何调整

python LCQMC_trainer.py --bert_path ../ChineseBERT-base/ --data_dir E:/PycharmProjects/ChineseBert/lcqmc/ --save_path ../../cus_output/ --max_epoch=7 --lr=2e-5 --batch_size=16 --gpus=0最后一个参数可以不写吗?需要如何调整参数和修改代码

the error of textClassifier task

Hello,I have a question about textClassifier task,it is no error in thucnews data,but there are many errors in my own data (all codes are the same to your codes), the answer of this question in internet is that the lable_map should begin 0,eg : {"体育": 0, "娱乐": 1, "家居": 2}, the same error is still existence after I follow this answer, can you solve my question?t hank you!
image

如何用自己的数据进一步预训练

您好!请问您有模型预训练的代码吗?尝试使用run_mlm.py[https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling/run_mlm.py]进行进一步预训练,但代码中调用的tokenizer和您的模型中的tokenizer(BertMaskDataset)不同,替换后遇到了许多问题,希望您可以提供帮助~谢谢!

训练的细节

请问你们是在什么设备上训练的,进行了多长时间

ner

下接crf为啥loss一直降不下去

你好 我遇到了一个gpu、

Some weights of the model checkpoint at E:\CODE\pythonProject\ChineseBert-main\CHINESEBERT_PATH were not used when initializing GlyceBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']

  • This IS expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
  • This IS NOT expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of GlyceBertForSequenceClassification were not initialized from the model checkpoint at E:\CODE\pythonProject\ChineseBert-main\CHINESEBERT_PATH and are newly initialized: ['classifier.weight', 'classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    CUDA_VISIBLE_DEVICES: [0]
    D:\Anaconda\envs\py37torch\lib\site-packages\pytorch_lightning\utilities\distributed.py:37: UserWarning: WORLD_SIZE environment variable (2) is not equal to the computed world size (1). Ignored.
    warnings.warn(*args, **kwargs)
    initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

请问这是怎么回事呢? 没有查到解决办法

Cannot Run your scripts

Hello Authors

What is the version of the transformers are you using? I was not able to execute your code. Moreover, if there are anymore dependencies please let us know.

关于segment embedding

文中提到因为预训练没有使用NSP任务而忽略了segment embedding,为什么源码fusion_embedding.py里还是出现了token_type_embeddings,虽然是用0初始化的

self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

if token_type_ids is None:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

token_type_embeddings = self.token_type_embeddings(token_type_ids)

embeddings = inputs_embeds + position_embeddings + token_type_embeddings

关于MSRA数据集的F1值

您好,许多论文在MSRA的数据集上的评估效果都是在95-96左右,而在您的论文中即使是BERT也达到了99,想请问一下您是否对该数据集做了预处理/或者说您论文中所使用的数据集和公开的有一些不同呢?
如果是对数据集做了处理能方便公开一下该数据集么

关于得到每个字字形向量

我目前是想要计算两个字字形之间相似度,需要得到字形向量
image
不知道hugging中哪个文件用于计算字形,以及字形向量的获得是直接model/glypg_embedding.py就可以获得吗

关于tokenizer

在拜读大神的文章和代码,对于tokenizer这里有些不解。
常规的tokenizer可以得到input_ids,
您的代码里的tokenizer可以同时得到input_ids和pinyin_ids,感觉很是神奇,
可否开源您训练tokenizer的代码呀

请教问题

你好,我记得论文中有做全词掩码策略,为什么在代码中没看到??
谢谢!!

请问大佬总是出现这种错误怎么办,不知道怎么改

(yl) D:\ChineseBert-main\tasks\THUCNew>python THUCNews_trainer.py --bert_path ./111/ --data_dir ./cnews/ --save_path ./222/ --max_epoch=5 --lr=2e-5 --batch_size=8 --gpus=0
Some weights of the model checkpoint at ./111/ were not used when initializing GlyceBertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']

  • This IS expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
  • This IS NOT expected if you are initializing GlyceBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of GlyceBertForSequenceClassification were not initialized from the model checkpoint at ./111/ and are newly initialized: ['classifier.weight', 'classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Traceback (most recent call last):
    File "THUCNews_trainer.py", line 229, in
    main()
    File "THUCNews_trainer.py", line 193, in main
    model = ChnSentiClassificationTask(args)
    File "THUCNews_trainer.py", line 51, in init
    self.model = GlyceBertForSequenceClassification.from_pretrained(self.bert_dir)
    File "d:\ProgramData\Anaconda3\envs\yl\lib\site-packages\transformers\modeling_utils.py", line 1071, in from_pretrained
    model.class.name, "\n\t".join(error_msgs)
    RuntimeError: Error(s) in loading state_dict for GlyceBertForSequenceClassification:
    size mismatch for bert.embeddings.glyph_embeddings.embedding.weight: copying a param with shape torch.Size([23236, 1728]) from checkpoint, the shape in current model is torch.Size([23236, 1152]).

ChnSetiCorp中加载字体文件错误

在运行ChnSetiCorp.py时,报错: Failed to interpret file 'ChineseBERT-base\config\._STFANGSO.TTF24.npy' as a pickle
请问这种问题该如何解决,在网上也没有找到答案.

关于parser参数中使用了dropout,但是在代码的其它部分却没有使用到这个参数

parser.add_argument("--hidden_dropout_prob", default=0.1, type=float, help="dropout probability")
代码中设置了 hidden_dropout_prob,但是却通篇找不到使用了参数的地方

比如
self.bert = GlyceBertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
modeling_glycebert.py中253行,直接使用了hidden_dropout_prob

运行问题

请问模型如果在pycharm里面可以直接运行吗?

想问一下直接使用测试怎么调整

已经训练过LCQMC_trainer。并且保存ckpt文件,如何使用ckpt文件进行test。
使用model = BQTask.load_from_checkpoint("F:/PycharmProjects/ChineseBert/output/bq/checkpoint/epoch=3-val_loss=0.5043-val_acc=0.8612.ckpt")加载保存的ckpt文件,会找不到参数
image

请教readme里面的那个示例的一个问题

刚开始用pytorch的新手,就是readme里面的那个示例“我喜欢猫”,多次调用 chinese_bert.forward输出的结果由于模型里面有dropout不是应该不同吗,为什么我多次调用输出的结果总是一样的呢?

使用性问题

您好,首先恭喜这份工作被ACL2021录用,融入字形和拼音的预训练必然会对中文nlp任务带来一定的提升。
同样,我也希望能在除了论文中提及的其他任务中使用ChineseBert,请问有没有集成类似于BERT的API可以调用,
如:
tokenizer = Tokenizer.from_pretrain([ChineseBert])
config = Config.from_pretrain([ChineseBert])
model = Bert.from_pretrain([ChineseBert])
或者,有没有instruction说明一下调用方式

接口调用时加载字体图片报错

作者您好,感谢您的工作!请问我在依照quick tour中的实例调用接口时产生报错:ValueError: cannot reshape array of size 3555312 into shape (23236,24,24),问题来自np.load(np_file).astype(np.float32) for np_file in font_npy_files,可以帮忙看看是什么原因吗?(是字体npy文件下载的有问题吗,重新下载了几遍都未解决,各依赖包的版本也都是完全按照requirement.txt中配置的)再次感谢您的工作和帮助~

您好,请问我想在fusion_embedding中删除一个embedding,遇到一个问题,请问该如何解决?十分感谢

File "/root/ChineseBert/ChineseBert-main/models/fusion_embedding.py", line 72, in forward
inputs_embeds = self.map_fc(concat_embeddings)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear
output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

关于启动训练代码报错

你好,请问一下,环境的安装和您的一致,为什么会报如下的错误,是不是python版本的问题,我用的是python3.8,望解答
image

新手请教路径问题

关于bert_dataset中
vocab_file = "/data/nfsdata2/sunzijun/glyce/glyce/bert_chinese_base_large_vocab/vocab.txt"
config_path = "/data/nfsdata2/sunzijun/glyce/glyce/config" 是改为CHINESEBERT_PATH中的路径吗

另外在bert_dataset中
tokenizer = BertMaskDataset(vocab_file, config_path) 的BertMaskDataset报红,
我试着在文件头加入from bert_mask_dataset import BertMaskDataset
提示ModuleNotFoundError: No module named 'bert_mask_dataset'

感谢您的查阅。

预训练细节

请问在你们预训练过程中,使用100GB的数据,是通过什么方式载入到dataset的。
据我所知,一般数据集不大的情况下是直接在dataset的__init__方法载入数据直接读进内存,但是数据量大的情况下采用什么方式?
我看到torch有提供一个iterabledataset,但是如果使用这个dataset的话,是不是就没有办法进行shuffle了呢?

The version conflict of the pytorch-lightening

Hello, I have met a version conflict and unable to manage it. (I am not quite familiar with the pytorch-lightning)
image

This expression here seems to have been deprecated in the newer version, this call back mechanism cannot be found in the User guide for pytorch lightning and the errrorpage also says that
image
so i changed it into the below way

image
and the afore-mentioned problem is gone, instead this thing appear
image
It seems tobe some what run out of num, how ever the dataset i custommed before was verified using the supplied method
image

There is also another possible thing that's probably going on, which is the relative path in use. But that is not likely to be the case

关于字音嵌入的问题

你好,我想问一下为什么字音嵌入的长度为8,而据我所知最长的拼音序列应该为6比如zhuang,加上音调长度也只为7,这样的话不就会一直空一个位置吗?

OSError: Failed to interpret file

hi,
I meet this error. OSError: Failed to interpret file 'chineseBert20210929/datasets/ChineseBERT-base/config/._STXINGKA.TTF24.npy' as a pickle. Do you know how to solve it. thanks.

Cannot reproduce fine tuning on ChnSentiCorp

Validation sanity check: 0it [00:00, ?it/s]thread '<unnamed>' panicked at 'no entry found for key', D:\a\tokenizers\tokenizers\tokenizers\src\models\mod.rs:36:66
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "ChnSetiCorp_trainer.py", line 227, in <module>
    main()
  File "ChnSetiCorp_trainer.py", line 216, in main
    trainer.fit(model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1084, in fit
    results = self.accelerator_backend.train(model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\accelerators\cpu_backend.py", line 39, in train
    results = self.trainer.run_pretrain_routine(model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1224, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1257, in _run_sanity_check
    eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 305, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
    w.start()
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\jingtanwang\Anaconda3\envs\chinesebert\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
pyo3_runtime.PanicException: no entry found for key

Based on huggingface/tokenizers#260: it suggests that vocab.json miss one entry, but I try with tokenizer.encode(sentence) with all lines in ChnSentiCorp, it works

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.