Giter Club home page Giter Club logo

cluener2020's Introduction

Chinese NER Project

本项目为CLUENER2020任务baseline的代码实现,模型包括

  • BiLSTM-CRF
  • BERT-base + X (softmax/CRF/BiLSTM+CRF)
  • Roberta + X (softmax/CRF/BiLSTM+CRF)

本项目BERT-base-X部分的代码编写思路参考 lemonhu

项目说明参考知乎文章:用BERT做NER?教你用PyTorch轻松入门Roberta!

Dataset

实验数据来自CLUENER2020。这是一个中文细粒度命名实体识别数据集,是基于清华大学开源的文本分类数据集THUCNEWS,选出部分数据进行细粒度标注得到的。该数据集的训练集、验证集和测试集的大小分别为10748,1343,1345,平均句子长度37.4字,最长50字。由于测试集不直接提供,考虑到leaderboard上提交次数有限,本项目使用CLUENER2020的验证集作为模型表现评判的测试集

CLUENER2020共有10个不同的类别,包括:组织(organization)、人名(name)、地址(address)、公司(company)、政府(government)、书籍(book)、游戏(game)、电影(movie)、职位(position)和景点(scene)。

原始数据分别位于具体模型的/data/clue/路径下,train.json和test.json文件中,文件中的每一行是一条单独的数据,一条数据包括一个原始句子以及其上的标签,具体形式如下:

{
	"text": "浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为,对目前国内商业银行而言,",
	"label": {
		"name": {
			"叶老桂": [
				[9, 11],
				[32, 34]
			]
		},
		"company": {
			"浙商银行": [
				[0, 3]
			]
		}
	}
}

该数据集的数据在标注时,由于需要保证数据的真实性存在一些质量问题,参见:数据问题一数据问题二,对整体没有太大影响。

Model

CLUENER2020官方的排行榜:传送门

本项目实现了CLUENER2020任务的baseline模型,对应路径分别为:

  • BiLSTM-CRF
  • BERT-Softmax
  • BERT-CRF
  • BERT-LSTM-CRF

其中,根据使用的预训练模型的不同,BERT-base-X 模型可转换为 Roberta-X 模型。

Requirements

This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:

  • tqdm
  • scikit-learn
  • pytorch >= 1.5.1
  • 🤗transformers == 2.2.2

To get the environment settled, run:

pip install -r requirements.txt

Pretrained Model Required

需要提前下载BERT的预训练模型,包括

  • pytorch_model.bin
  • vocab.txt

放置在./pretrained_bert_models对应的预训练模型文件夹下,其中

bert-base-chinese模型:下载地址

注意,以上下载地址仅提供tensorflow版本,需要根据huggingface suggest将其转换为pytorch版本。

chinese_roberta_wwm_large模型:下载地址

如果觉得麻烦,pytorch版本的上述模型可以通过下方网盘链接直接获取😊:

链接: https://pan.baidu.com/s/1rhleLywF_EuoxB2nmA212w 密码: isc5

Results

各个模型在数据集上的结果(f1 score)如下表所示:(Roberta均指RoBERTa-wwm-ext-large模型)

模型 BiLSTM+CRF Roberta+Softmax Roberta+CRF Roberta+BiLSTM+CRF
address 47.37 57.50 64.11 63.15
book 65.71 75.32 80.94 81.45
company 71.06 76.71 80.10 80.62
game 76.28 82.90 83.74 85.57
government 71.29 79.02 83.14 81.31
movie 67.53 83.23 83.11 85.61
name 71.49 88.12 87.44 88.22
organization 73.29 74.30 80.32 80.53
position 72.33 77.39 78.95 78.82
scene 51.16 62.56 71.36 72.86
overall 67.47 75.90 79.34 79.64

Parameter Setting

1.model parameters

在./experiments/clue/config.json中设置了Bert/Roberta模型的基本参数,而在./pretrained_bert_models下的两个预训练文件夹中,config.json除了设置Bert/Roberta的基本参数外,还设置了'X'模型(如LSTM)参数,可根据需要进行更改。

2.other parameters

环境路径以及其他超参数在./config.py中进行设置。

Usage

打开指定模型对应的目录,命令行输入:

python run.py

模型运行结束后,最优模型和训练log保存在./experiments/clue/路径下。在测试集中的bad case保存在./case/bad_case.txt中。

Attention

目前,当前模型的train.log已保存在./experiments/clue/路径下,如要重新运行模型,请先将train.log移出当前路径,以免覆盖。

cluener2020's People

Contributors

corlder avatar hemingkx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cluener2020's Issues

requirements.txt有问题

@ file:///tmp/build/80754af9/pillow_1603822238230/work
requirements.txt里的这些都是什么?

运行出错,关于BertConfig

按照readme中的python、torch版本都配置了
但是运行出错:loading weights file pretrained_bert_models/chinese_roberta_wwm_large_ext/pytorch_model.bin的时候
AttributeError: 'BertConfig' object has no attribute 'lstm_embedding_size'

ValueError: mask of the first timestep must all be on

作者大大你好,我只修改了config.py里面的标签,数据里面的空格我用的‘,’代替的,然后跑BERT-LSTM-CRF是正常的,但是相同的数据在BERT-CRF时就会报这个错误,错误是在model.py中的loss = self.crf(logits, labels, loss_mask) * (-1)请问是什么原因呀,麻烦您啦

没有显示报错但是模型没有正常加载运行

系统日志如下:
2024-05-23 21:16:07,477:INFO: device: cuda:0
2024-05-23 21:16:07,478:INFO: --------Process Done!--------
2024-05-23 21:16:26,466:INFO: --------Dataset Build!--------
2024-05-23 21:16:26,467:INFO: --------Get Dataloader!--------
然后运行结束,由于cuda版本为12.1无法兼容低版本的pytorch,使用的pytorch版本为2.3
其余的包都正常安装,当时项目无法运行,求解答,谢谢!!

使用BIO标注问题

您好,我的数据集是BIO标注的,修改完相应的代码之后,在badcase中存在I-X标签单独出现的情况,请问这种情况是哪里出了问题,需要修改哪里的代码呢?

Didn't find file /pretrained_bert_models/bert-base-chinese/added_tokens.json. We won't load it. Didn't find file /pretrained_bert_models/bert-base-chinese/special_tokens_map.json. We won't load it. Didn't find file /pretrained_bert_models/bert-base-chinese/tokenizer_config.json. We won't load it.

Didn't find file /pretrained_bert_models/bert-base-chinese/added_tokens.json. We won't load it.
Didn't find file /pretrained_bert_models/bert-base-chinese/special_tokens_map.json. We won't load it.
Didn't find file /pretrained_bert_models/bert-base-chinese/tokenizer_config.json. We won't load it.

请问如何使用gpu来训练呢?

网上说的使用gpu训练有几个地方都需要设置,数据集、损失函数、模型. 代码里面模型的设置. 请问代码可以在gpu环境下运行吗

执行data_process.py得到.npz文件

作者你好,想问一下,我将test.json文件里的数据改了之后,重新执行data_process.py为什么没有生成对应的npz文件啊,是要指定文件名吗

model.py中nn是哪个package里面来的?

model.py 中:

from transformers.models.bert.modeling_bert import *
from torch.nn.utils.rnn import pad_sequence
from torchcrf import CRF


class BertNER(BertPreTrainedModel):
    def __init__(self, config):
        super(BertNER, self).__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.bilstm = nn.LSTM(
            input_size=config.lstm_embedding_size,  # 1024
            hidden_size=config.hidden_size // 2,  # 1024
            batch_first=True,
            num_layers=2,
            dropout=config.lstm_dropout_prob,  # 0.5
            bidirectional=True
        )
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(config.num_labels, batch_first=True)

        self.init_weights()

比如这里self.dropout = nn.Dropout(config.hidden_dropout_prob), 并没有导入import torch.nn as nn 这里的nn是哪个package里面来的?

IndexError: index out of range in self

请问一下,在某些输入数据中,产生了这样的问题.
Traceback (most recent call last):
File "D:/Bert+Once/CLUENER2020/BERT-CRF/CheckBug.py", line 14, in
embed = embedding(input_to_embed)
File "D:\Bert+Once\venv\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\Bert+Once\venv\lib\site-packages\torch\nn\modules\sparse.py", line 145, in forward
return F.embedding(
File "D:\Bert+Once\venv\lib\site-packages\torch\nn\functional.py", line 1913, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
是什么原因呢.

ValueError: cannot copy sequence with size 37 to array axis with dimension 36

你好 我换成BIEOS数据标签后,test数据没有标签。我每个字添加一个临时标签都是O,
然后允许模型,出现了以下错误,请指教!

File "/NER/CLUENER2020/BERT-LSTM-CRF/train.py", line 83, in evaluate
    for idx, batch_samples in enumerate(dev_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "NER/CLUENER2020/BERT-LSTM-CRF/data_loader.py", line 97, in collate_fn
    batch_labels[j][:cur_tags_len] = labels[j]

data_process模块运行有误

np.savez_compressed API在运行时会提示ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10748,) + inhomogeneous part.
因为numpy列表不允许其元素为变长列表,我试过几次都无法正确执行将json转为.npz文件,都会提示这个错误

如果替换成英文数据集

用bert-base-uncased对英文句子进行tokenize的时候,token 的长度自然与labels不一致,请问该如何处理呢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.