Giter Club home page Giter Club logo

bert-chinese-classification-task's Introduction

bert-Chinese-classification-task

bert中文分类实践

在run_classifier_word.py中添加NewsProcessor,即新闻的预处理读入部分
在main方法中添加news类型数据处理label
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"news": NewsProcessor,
}

download_glue_data.py 提供glue_data下面其他的bert论文公测glue数据下载

data目录下是news数据的样例

export GLUE_DIR=/search/odin/bert/extract_code/glue_data
export BERT_BASE_DIR=/search/odin/bert/chinese_L-12_H-768_A-12/
export BERT_PYTORCH_DIR=/search/odin/bert/chinese_L-12_H-768_A-12/

python run_classifier_word.py
--task_name NEWS
--do_train
--do_eval
--data_dir $GLUE_DIR/NewsAll/
--vocab_file $BERT_BASE_DIR/vocab.txt
--bert_config_file $BERT_BASE_DIR/bert_config.json
--init_checkpoint $BERT_PYTORCH_DIR/pytorch_model.bin
--max_seq_length 256
--train_batch_size 32
--learning_rate 2e-5
--num_train_epochs 3.0
--output_dir ./newsAll_output/
--local_rank 3

中文分类任务实践

实验中对中文34个topic进行实践(包括:时政,娱乐,体育等),在对run_classifier.py代码中的预处理环节需要加入NewsProcessor模块,及类似于MrpcProcessor,但是需要对中文的编码进行适当修改,训练数据与测试数据按照4:1进行切割,数据量约80万,单卡GPU资源,训练时间18小时,acc为92.8%

eval_accuracy = 0.9281581998809113

eval_loss = 0.2222444740207354

global_step = 59826

loss = 0.14488934577978746

bert-chinese-classification-task's People

Contributors

nlpscott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-chinese-classification-task's Issues

RuntimeError: Error(s) in loading state_dict for BertModel:

您好,很感谢您提供代码,本人水平有限,在执行这一步时:
model.bert.load_state_dict(torch.load(args.init_checkpoint, map_location='cpu'))
遇到以下错误:
RuntimeError: Error(s) in loading state_dict for BertModel: Missing key(s) in state_dict: "embeddings.word_embeddings.weight. ...."
请问这是为什么呢。

No module

ModuleNotFoundError: No module named 'optimization'

能否把optimization和pytorch的checkpoint这个也放进来

能否把optimization和pytorch的checkpoint这个也放进来,我用最新的bert-pytorch master的代码转的checkpoint报错:
model.bert.load_state_dict(torch.load(args.init_checkpoint, map_location='cpu'))
RuntimeError: Error(s) in loading state_dict for BertModel:
Missing key(s) in state_dict:

loading state_dict for BertModel

您好,非常感谢您的代码:
我在调试的时候,下载了谷歌的chinese_base压缩包,解压后,用https://github.com/huggingface/pytorch-pretrained-BERT/tree/1de35b624b9d7998feb4d518e4f7e8e53abac4e1的方法转化成bin。或者是用https://github.com/NLPScott/bert-Chinese-classification-task/issues/13这里提供的chinese版本,都会遇到模型载入的错误。
RuntimeError: Error(s) in loading state_dict for BertModel:
Missing key(s) in state_dict: "embeddings.word_embeddings.weight",
可以发现是模型的名字对应错误,应该是名字有了调整,这里我解决不了,您能帮忙看看吗?

Max_sequence_length

Hi,
我想问下 max_sequence_length = 256 在中文里是不是指 256个词?也就是每个样本最多能读入接近500~600个字?

Tks

预训练模型

这里的BERT预训练模型是怎样得到的?
或是
直接用BERT做分类任务,没有根据Masked LM和Next sentence 预训练?

需要对中文编码适当修改?

你好,我看你在做分类的时候,读入训练数据按照gbk格式来读取,请问这里设定编码格式是必须的么,我的训练数据格式就是utf-8格式的,读取我直接按照默认读取,并没有设置什么编码格式,而且程序也没有问题,但是训练结果并不好,这种现象是和编码有联系么?谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.