Giter Club home page Giter Club logo

bert-ner's Introduction

BERT-NER

This project implements a solution to the "X" label issue (e.g., #148, #422) of NER task in Google's BERT paper, and is developed mostly based on lemonhu's work and bheinzerling's suggestion.

Dataset

Requirements

This repo was tested on Python 3.6+ and PyTorch 1.3.1. The main requirements are:

  • nltk
  • tqdm
  • pytorch >= 1.3.1
  • 🤗transformers == 2.2.2
  • tensorflow == 1.11.0 (Optional)

Note: The tensorflow library is only used for the conversion of pretrained models from TensorFlow to PyTorch.

Quick Start

  • Download and unzip the Chinese (English) NER model weights under experiments/msra(conll)/, then run:

    python build_dataset_tags.py --dataset=msra
    python interactive.py --dataset=msra

    to try it out and interact with the pretrained NER model.

Usage

  1. Get BERT model for PyTorch

    There are two ways to get the pretrained BERT model in a PyTorch dump for your experiments :

    • [Automatically] Download the specified pretrained BERT model provided by huggingface🤗

    • [Manually] Convert the TensorFlow checkpoint to a PyTorch dump

      • Download the Google's BERT pretrained models for Chinese (BERT-Base, Chinese) and English (BERT-Base, Cased). Then decompress them under pretrained_bert_models/bert-chinese-cased/ and pretrained_bert_models/bert-base-cased/ respectively. More pre-trained models are available here.

      • Execute the following command, convert the TensorFlow checkpoint to a PyTorch dump as huggingface suggests. Here is an example of the conversion process for a pretrained BERT-Base Cased model.

        export TF_BERT_MODEL_DIR=/full/path/to/cased_L-12_H-768_A-12
        export PT_BERT_MODEL_DIR=/full/path/to/pretrained_bert_models/bert-base-cased
         
        transformers bert \
          $TF_BERT_MODEL_DIR/bert_model.ckpt \
          $TF_BERT_MODEL_DIR/bert_config.json \
          $PT_BERT_MODEL_DIR/pytorch_model.bin
      • Copy the BERT parameters file bert_config.json and dictionary file vocab.txt to the directory $PT_BERT_MODEL_DIR.

        cp $TF_BERT_MODEL_DIR/bert_config.json $PT_BERT_MODEL_DIR/config.json
        cp $TF_BERT_MODEL_DIR/vocab.txt $PT_BERT_MODEL_DIR/vocab.txt
        
  2. Build dataset and tags

    if you use default parameters (using CONLL-2003 dataset as default) , just run

    python build_dataset_tags.py

    Or specify dataset (e.g., MSRA) and other parameters on the command line

    python build_dataset_tags.py --dataset=msra

    It will extract the sentences and tags from train_bio, test_bio and val_bio(if not provided, it will randomly sample 5% data from the train_bio to create val_bio). Then split them into train/val/test and save them in a convenient format for our model, and create a file tags.txt containing a collection of tags.

  3. Set experimental hyperparameters

    We created directories with the same name as datasets under the experiments directory. It contains a file params.json which sets the hyperparameters for the experiment. It looks like

    {
        "full_finetuning": true,
        "max_len": 180,
        "learning_rate": 5e-5,
        "weight_decay": 0.01,
        "clip_grad": 5,
    }

    For different datasets, you will need to create a new directory under experiments with params.json.

  4. Train and evaluate the model

    if you use default parameters (using CONLL-2003 dataset as default) , just run

    python train.py

    Or specify dataset (e.g., MSRA) and other parameters on the command line

    python train.py --dataset=msra

    A proper pretrained BERT model will be automatically chosen according to the language of the specified dataset. It will instantiate a model and train it on the training set following the hyper-parameters specified in params.json. It will also evaluate some metrics on the development set.

  5. Evaluation on the test set

    Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set.

    if you use default parameters (using CONLL-2003 dataset as default) , just run

    python evaluate.py

    Or specify dataset (e.g., MSRA) and other parameters on the command line

    python evaluate.py --dataset=msra

bert-ner's People

Contributors

weizhepei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

bert-ner's Issues

Sequence Labeling Issue

How to label words that bert vocab does not contain?
This situation happens in English corpus. For example, the word 'jony' is tokenizered to 'jon', '##y'. If in the origin corpus 'jony' is labeled with B-PER, how to modify the corresponding sequence labels in tokens produced by BERT in your code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.