BERT-NER

This project implements a solution to the "X" label issue (e.g., #148, #422) of NER task in Google's BERT paper, and is developed mostly based on lemonhu's work and bheinzerling's suggestion.

Dataset

Chinese: MSRA, which is reported to be incomplete. A complete version can be found here.
English: CONLL-2003

Requirements

This repo was tested on Python 3.6+ and PyTorch 1.3.1. The main requirements are:

nltk
tqdm
pytorch >= 1.3.1
🤗transformers == 2.2.2
tensorflow == 1.11.0 (Optional)

Note: The tensorflow library is only used for the conversion of pretrained models from TensorFlow to PyTorch.

Quick Start

Download and unzip the Chinese (English) NER model weights under experiments/msra(conll)/, then run:
```
python build_dataset_tags.py --dataset=msra
python interactive.py --dataset=msra
```
to try it out and interact with the pretrained NER model.

Usage

Get BERT model for PyTorch

There are two ways to get the pretrained BERT model in a PyTorch dump for your experiments :
- [Automatically] Download the specified pretrained BERT model provided by huggingface🤗
- [Manually] Convert the TensorFlow checkpoint to a PyTorch dump
  - Download the Google's BERT pretrained models for Chinese (BERT-Base, Chinese) and English (BERT-Base, Cased). Then decompress them under pretrained_bert_models/bert-chinese-cased/ and pretrained_bert_models/bert-base-cased/ respectively. More pre-trained models are available here.
  - Execute the following command, convert the TensorFlow checkpoint to a PyTorch dump as huggingface suggests. Here is an example of the conversion process for a pretrained BERT-Base Cased model.
```
export TF_BERT_MODEL_DIR=/full/path/to/cased_L-12_H-768_A-12
export PT_BERT_MODEL_DIR=/full/path/to/pretrained_bert_models/bert-base-cased
 
transformers bert \
  $TF_BERT_MODEL_DIR/bert_model.ckpt \
  $TF_BERT_MODEL_DIR/bert_config.json \
  $PT_BERT_MODEL_DIR/pytorch_model.bin
```
  - Copy the BERT parameters file bert_config.json and dictionary file vocab.txt to the directory $PT_BERT_MODEL_DIR.
```
cp $TF_BERT_MODEL_DIR/bert_config.json $PT_BERT_MODEL_DIR/config.json
cp $TF_BERT_MODEL_DIR/vocab.txt $PT_BERT_MODEL_DIR/vocab.txt
```
Build dataset and tags

if you use default parameters (using CONLL-2003 dataset as default) , just run
```
python build_dataset_tags.py
```
Or specify dataset (e.g., MSRA) and other parameters on the command line
```
python build_dataset_tags.py --dataset=msra
```
It will extract the sentences and tags from train_bio, test_bio and val_bio(if not provided, it will randomly sample 5% data from the train_bio to create val_bio). Then split them into train/val/test and save them in a convenient format for our model, and create a file tags.txt containing a collection of tags.
Set experimental hyperparameters

We created directories with the same name as datasets under the experiments directory. It contains a file params.json which sets the hyperparameters for the experiment. It looks like
```
{
    "full_finetuning": true,
    "max_len": 180,
    "learning_rate": 5e-5,
    "weight_decay": 0.01,
    "clip_grad": 5,
}
```
For different datasets, you will need to create a new directory under experiments with params.json.
Train and evaluate the model

if you use default parameters (using CONLL-2003 dataset as default) , just run
```
python train.py
```
Or specify dataset (e.g., MSRA) and other parameters on the command line
```
python train.py --dataset=msra
```
A proper pretrained BERT model will be automatically chosen according to the language of the specified dataset. It will instantiate a model and train it on the training set following the hyper-parameters specified in params.json. It will also evaluate some metrics on the development set.
Evaluation on the test set

Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set.

if you use default parameters (using CONLL-2003 dataset as default) , just run
```
python evaluate.py
```
Or specify dataset (e.g., MSRA) and other parameters on the command line
```
python evaluate.py --dataset=msra
```

weizhepei / bert-ner Goto Github PK

bert-ner's Introduction

BERT-NER

Dataset

Requirements

Quick Start

Usage

bert-ner's People

Contributors

Stargazers

Watchers

Forkers

bert-ner's Issues

Sequence Labeling Issue

请问在自己的数据集训练时，f1为什么一直是0呢？

请问如何训练自己的数据集呢？有办法吗

请问，这个怎么自己输入句子测试模型呢？

请问conll2003的测试集f值大概是多少？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent