Giter Club home page Giter Club logo

sequence_tagging's Introduction

sequence tagging project

-- 更新update 2020.08.16: --

增加了论文实现Enhancing Pre-trained Chinese Character Representation with Word-aligned Attention ,基于pytorch 1.6.0,以及huggingface的transformers 2.11.0。分词工具使用了百度的lac2.0,thulac,ltp,请自行安装这些工具。

另外,我上传了已经分词好的结果,在data下面的set_orig_data_*.json中。如果想自行分词,请参考word_seg_utils.py代码。

可以使用test_data_processing.py来生成数据文件。

训练pytorch运行脚本:

bash run_torch.sh

reference

pytorch ner框架实现参考了 https://github.com/lonePatient/BERT-NER-Pytorch ,mwa实现参考了 https://github.com/lsvih/MWA


-- 更新update 2020.02.05: --

补充了缺失的代码和脚本,同时加了一点样例数据,方便测试代码是否可以完整运行。


依赖包:主要是tensorflow 1.12.0,其余见requirements.txt

目前项目包含了传统的Bilstm-crf模型和使用了bert的模型。

针对的数据:目前是基于字符级别标注的实体识别数据。使用网上公开的字符级的中文词向量。

TODO:

1、结合词级别的词向量与字符向量结合,做字符级别的tagging,已完成100%

2、joint learning with intent classification

3、不用bert,在lstm-crf基础上进行优化,增加cnn的架构,或者attention机制。已完成100%

4、seq label 转化为阅读理解问题。参考最新的论文 A Unified MRC Framework for Named Entity Recognition 已完成100%

使用的词向量来源于:

https://github.com/Embedding/Chinese-Word-Vectors

词向量模型存放在data/embedding_data路径下

使用的bert预训练模型为:

chinese_roberta_wwm_ext_L-12_H-768_A-12

bert预训练模型存放在data/根路径下

训练数据目前暂时没法传上来,但是格式可以如下所示:

海 钓 比 赛 地 点 在 厦 门 与 金 门 之 间 的 海 域 。

O O O O O O O B-LOC I-LOC O B-LOC I-LOC O O O O O O

如上为一条样本。项目中的data_preprocessing会根据不同的方法做预处理,并将处理后的数据用.npy格式存储。

目前bert+mrc训练和评测没有问题,其他方法待优化完善。

训练运行脚本:

bash run_train.sh

评测运行脚本:

bash run_pred.sh

实验结果部分汇总:

method f1-micro-avg
bilstm+$crf_{baseline}$ 0.8702
bilstm+crf+wordemb 0.8783
bilstm+cnn+crf+wordemb 0.8818
bert+celoss 0.9333
bert+bilstm+crf 0.9387
bert+diceloss 0.9354
bert+mrc+celoss 0.9550
bert+mrc+focalloss 0.9580

sequence_tagging's People

Contributors

qiufengyuyi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sequence_tagging's Issues

args.do_test

请问作者是不是没有把train_helper.py里面的if args.do_test:部分写进去呢?运行load_and_predict.py时,显示No such file or directory:'prediction_result_completelabel_bert_nocrf_dlloss.npy',应该是没有进行预测过程生成.npy文件吧

训练自己数据问题

你好,我使用新数据训练 发现train dev的loss 很小 只有 1e-14 这个级别 这个应该不正常吧 是哪里出问题了呢

Code is not complete

Two problems:

  1. I didn't find any functions where you define dice_dsc_loss or focal_dsc_loss
  2. Your run.py has been changed to optimization.py, please check. I can found it in your last version.

data_processing中没有lstmcrf_prepare_data

运行run_torch的时候数据处理出现问题,一是data_processing中没有lstmcrf_prepare_data这个文件,另外训练语料通过不同分词器得到的word segment的json文件结构有问题,是不完整的,需要把每个json文件的最后一条不完整的数据删除掉

关于 bert-mrc的几个问题

https://github.com/qiufengyuyi/sequence_tagging/blob/master/models/bert_mrc.py
然后看到代码中关于bert_mrc计算f1时,不应该是根据实体来计算的吗
image
看代码中是根据start end指针来分开计算的吗

最近也在复现bert-mrc模型,基于MSRA数据集,但是效果10个epoch F1只能达到 0.64,50个epoch效果会好点。但是也到不了0.9多,不知道是哪里存在问题了。

然后在根据start end指针提取实体的时候,会遇到start指针有些不能与end指针匹配上的问题。

optimization

in train_helper.py
from optimization import BertAdam
i can't find BertAdam in optimization
only class AdamWeightDecayOptimizer in optimization.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.