Giter Club home page Giter Club logo

xlnet_zh's Introduction

XLNet for Chinese, TensorFlow & PyTorch

XLNet中文预训练模型

XLNet是CMU和谷歌大脑在2019年6月份,提出的一个新的预训练模型。在多个任务的性能超越Bert。它是在保留自回归语言模型(Autoregressive Language Modeling)的形式下,

结合了自编码语言模型(Autoencoding Language Modeling)的优势,提出了排列语言模型(Permutation Language Modeling)。并且它基于Transfomer-XL,

有更好的处理长文本的能力。

本项目参考[2]的工作,结合海量数据,训练了一个24层的中文XLNet_zh_Large模型,含3亿多参数。

训练数据和计算资源 Training Corpus & Training Details

训练数据,包括新闻、互动讨论、百科,超过30G原始文本,近100亿个中文字; 本项目与中文预训练RoBERTa模型的RoBERTa_zh项目,使用相同的训练数据。

使用Google TPU v3-256 训练2天得到;包含32个v3-8机器,每个v3-8机器包含128G的显存;训练了20万步,使用序列长度(sequence_length)512,批次(batch_size)为512。

注意事项 Notices

XLNet_zh_Large还没有完整测试,可能在你的任务中有极好的表现,也可能在部分任务中有糟糕的表现。我们预计既会有好消息,也有坏消息;但目前在句子对任务中(LCQMC任务)是坏消息。

提供您的测试对比 Performance

如果你使用本项目的中文预训练模型,请告诉你的测试对比效果:你可以直接发生pull request将你的任务中的测试对比添加到README.md中,或发在issue中;

你也可以加入中文预训练模型transformers讨论群(QQ:836811304),并把测试对比告知我们。

XLNet中文预训练模型-下载 Download Pre-trained XLNet, for Chinese tasks

XLNet_zh_Large, 百度网盘,或 Google drive,TensorFlow版本

暂时没有去掉adam参数,去掉后模型会变成1.3G左右。

XLNet_zh_Large_L-24_H-1024_A-16.zip 
  |- xlnet_model.ckpt    # 模型权重
  |- xlnet_model.index   # 模型meta信息
  |- xlnet_model.meta    # 模型index新
  |- xlnet_config.json: # 配置文件
  |- spiece.model:       # 词汇表

PyTorch版本,可使用类似的命名来转换,具体建pytorch_transformers项目:

python -u -m pytorch_transformers.convert_tf_checkpoint_to_pytorch --tf_checkpoint_path XLNet-zh-Large-PyTorch/ --bert_config_file XLNet-zh-Large-PyTorch/config.json --pytorch_dump_path XLNet-zh-Large-PyTorch/xlnet_zh_large_pytorch_model.bin

如何保留从左到右的方式预测(就像传统的语言模型一样),但还能利用下文的信息?

1.input_list:   [1, 2, 3, 4, 5, 6]
2.sampled_list: [2, 4, 6, 5, 3, 1]
3.array_2d:
                [[0. 1. 1. 1. 1. 1.]
                 [0. 0. 0. 0. 0. 0.]
                 [0. 1. 0. 1. 1. 1.]
                 [0. 1. 0. 0. 0. 0.]
                 [0. 1. 0. 1. 0. 1.]
                 [0. 1. 0. 1. 0. 0.]]

import numpy as np
import random
def xlnet_mask(input_list):
    """
    输入一个列表(如:[x1,x2,x3,x4]),采样到一个新的组合(如:[x3,x2,x4,x1])返回一个矩阵
    要实现的是让当前单词Xi只能看到这个新顺序中自己前面的单词
    即:对于序列[x3,x2,x4,x1]
        x2能看到x3;
        x4能看到x3,x2
        x1能看到x3,x2,x4
        x3什么也看不到
    看到在程序里,是1,看不到是0.
    :param input_list:
    :return: matrix
    e.g
    [[0,1,1,1],  # x1
     [0,0,1,0],  # x2
     [0,0,0,0],  # x3
     [0,1,1,0]]  # x4

    """
    print("1.input_list:",input_list)
    random.shuffle(input_list) # 打乱循序
    sampled_list=input_list
    print("2.sampled_list:",sampled_list)
    num_size=len(input_list)
    
    array_2d=np.zeros((num_size,num_size))
    for index,current_element in enumerate(sampled_list):
        previous_element_list=sampled_list[0:index] # 被采样的组合中当前元素中自己前面的单词
        for previous_element in previous_element_list:
            array_2d[current_element-1][previous_element-1]=1
    
    print("3.array_2d:\n",array_2d)
    return array_2d

input_list=[1,2,3,4,5,6]
array_2d=xlnet_mask(input_list)

效果测试与对比 Performance

请您报告并添加。

数据集或任务不限,包括XNLI、LCQMC、阅读理解数据集CMRC、CCF-Sentiment-Analysis等等。

模型加载(以Sentence Pair Matching即句子对任务,LCQMC为例)

预训练

1、生成tfrecords:

SAVE_DIR=gs://xlnet_zh/tf_records_xlnet
INPUT=gs://raw_text/data_2019_raw/*.txt 
nohup python -u data_utils.py \
    --bsz_per_host=32 \
    --num_core_per_host=8 \
    --seq_len=512 \
    --reuse_len=256 \
    --input_glob=${INPUT} \
    --save_dir=${SAVE_DIR} \
    --num_passes=20 \
    --bi_data=True \
    --sp_path=spiece.model \
    --mask_alpha=6 \
    --mask_beta=1 \
    --num_predict=85 \
    --uncased=False \
    --num_task=200 \
    --task=1 &

第一步假设你已经有了词汇表(本项目中的词汇表位于src/spiece.model);如果你需要建立生成自己的词汇表见下方,更多信息参考:SentencePiece

生成词汇表: spm_train
--input=gs://raw_text/data_2019_raw/*.txt
--model_prefix=sp10m.cased.v3
--vocab_size=32000
--character_coverage=0.99995
--model_type=unigram
--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod>
--user_defined_symbols=<eop>,.,(,),",-,–,£,€
--shuffle_input_sentence
--input_sentence_size=200000000

2、训练模型:

DATA=gs://xlnet_zh/tf_records_xlnet/tfrecords/
MODEL_DIR=gs://xlnet_zh/xlnet_zh_large
TPU_NAME=xlnet-zh-large-v3-256 
TPU_ZONE=europe-west4-a
nohup python train.py \
    --record_info_dir=$DATA \
    --model_dir=$MODEL_DIR \
    --train_batch_size=512 \
    --num_hosts=32 \
    --num_core_per_host=8 \
    --seq_len=512 \
    --reuse_len=256 \
    --mem_len=384 \
    --perm_size=256 \
    --n_layer=24 \
    --d_model=1024 \
    --d_embed=1024 \
    --n_head=16 \
    --d_head=64 \
    --d_inner=4096 \
    --untie_r=True \
    --mask_alpha=6 \
    --mask_beta=1 \
    --num_predict=85 \
    --uncased=False \
    --train_steps=200000 \
    --save_steps=3000 \
    --warmup_steps=10000 \
    --max_save=30 \
    --weight_decay=0.01 \
    --adam_epsilon=1e-6 \
    --learning_rate=1e-5 \
    --dropout=0.1 \
    --dropatt=0.1 \
    --tpu=$TPU_NAME \
    --tpu_zone=$TPU_ZONE \
    --use_tpu=True \
    --track_mean=True &

fine-tuning(以LCQMC任务为例)

XLNET_DIR=gs://xlnet_zh/xlnet_zh_large
MODEL_DIR=gs://xlnet_zh/fine_tuning_test/lcqmc_01
DATA_DIR=gs://xlnet_zh/fine_tuning_test/lcqmc_01/lcqmc_tfrecords
RAW_DIR=gs://roberta_zh/compare_model_performance/lcqmc
TPU_NAME=grpc://03.06.08.09:8470
TPU_ZONE=us-central1-a
nohup python -u run_classifier.py \
    --spiece_model_file=./spiece.model \
    --model_config_path=${XLNET_DIR}/config.json \
    --init_checkpoint=${XLNET_DIR}/model.ckpt-192000 \
    --task_name=lcqmc \
    --do_train=True \
    --do_eval=True \
    --eval_all_ckpt=True \
    --uncased=False \
    --data_dir=${RAW_DIR} \
    --output_dir=${DATA_DIR} \
    --model_dir=${MODEL_DIR} \
    --train_batch_size=128 \
    --eval_batch_size=8 \
    --num_hosts=1 \
    --num_core_per_host=8 \
    --num_train_epochs=3 \
    --max_seq_length=128 \
    --learning_rate=2e-5 \
    --save_steps=1000 \
    --use_tpu=True \
    --tpu=${TPU_NAME} \
    --tpu_zone=${TPU_ZONE} >> xlnet_large_lcqmc_1.out &

注: TPU_NAME is dummy, you should change IP to real one

Learning Curve 学习曲线

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

Reference

[1] XLNet: Generalized Autoregressive Pretraining for Language Understanding

[2] Chinese-PreTrained-XLNet

[3] XLNet:运行机制及和Bert的异同比较

xlnet_zh's People

Contributors

brightmart avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

xlnet_zh's Issues

Unsuccessful TensorSliceReader constructor

预训练脚本:

python $SCRIPT_DIR/run_classifier.py
--spiece_model_file=${XLNET_DIR}/spiece.model
--model_config_path=${XLNET_DIR}/xlnet_config.json
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt \

加载模型碰到问题:
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /data/user0/test/classifier/bin/xlnet/data//xlnet_model.ckpt
Traceback (most recent call last):
File "/data/user0/test/classifier/bin/xlnet//run_classifier.py", line 1103, in
tf.app.run()
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/user0/test/classifier/bin/xlnet//run_classifier.py", line 971, in main
estimator.train(input_fn=train_input_fn, max_steps=train_steps)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1122, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1185, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1254, in _actual_train_model_distributed
self.config))
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1199, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 641, in _call_for_each_replica
return _call_for_each_replica(self._container_strategy(), fn, args, kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 189, in _call_for_each_replica
coord.join(threads)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 852, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/data/user0/test/classifier/bin/xlnet//run_classifier.py", line 750, in model_fn
scaffold_fn = model_utils.init_from_checkpoint(FLAGS)
File "/data/user0/test/classifier/bin/xlnet/model_utils.py", line 88, in init_from_checkpoint
) = get_assignment_map_from_checkpoint(tvars, init_checkpoint)
File "/data/user0/test/classifier/bin/xlnet/model_utils.py", line 293, in get_assignment_map_from_checkpoint
init_vars = tf.train.list_variables(init_checkpoint)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 95, in list_variables
reader = load_checkpoint(ckpt_dir_or_file)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 64, in load_checkpoint
return pywrap_tensorflow.NewCheckpointReader(filename)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 326, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /data/user0/test/classifier/bin/xlnet/data//xlnet_model.ckpt

Pytorch版本

您好,我在使用您提供的命令转化Pytorch版本时,会报错,请问是什么原因?

RuntimeError: Sizes mush be non-negative。应该是vocab_size=-1导致的?

中文MRC任务

想问一下这个用到中文的MRC任务上效果怎么样??

Pytorch模型

您好,可以麻烦提供Pytorch版本吗?目前在使用pytorch transformer转化时出错了。后续可以跟您报告我们模型在使用xlnet模型做情感分类(CCF BDCI)的准确率,谢谢

Google drive

你好,能否把模型上传到google drive里,现在google drive的链接还是空的。我在公司没法从百度网盘上下载,谢谢了

参数不一致问题: 预训练的reuse_len=256, 下载文件中config.json的reuse_len=null

您好.

我发现有两个参数在训练时和预训练文件中的值不一致.

https://github.com/brightmart/xlnet_zh预训练中的1、生成tfrecords:中的配置中: --reuse_len=256 \

https://github.com/brightmart/xlnet_zh预训练中的2、训练模型:中的配置中: --mem_len=384 \

在下载预训练文件的config.json中.(12层的小模型, 24层的大模型配置文件都如此)

  "mem_len": null,
  "reuse_len": null,

请问这是什么原因导致的? 这在预测的时候是否会导致xlnet退化成bert?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.