Alternate-training for Conditional hidden Markov model and BERT-NER.
This code accompanies the paper BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition.
To view the previous version of program used for the paper, switch to branch
prev
.
Conditional hidden Markov model (CHMM) is also included in the Wrench project ๐ง
Please check requirement.txt
for the package dependency requirement.
The data construction program may need the specified versions of spaCy
and AllenNLP
.
The model training program should be compatible with any package versions.
Note: This repo contains submodules.
Cloning this repo does not automatically clone any files in the submodule folder.
To get those files, use git submodule update --init
(ref).
The dataset construction program for the NCBI-Disease
, BC5CDR
and LaptopReview
datasets is modified from the wiser project (paper)
that contains three repos.
The dataset construction program for the CoNLL 2003
dataset is based on skweak.
The source data are provided in the folders DataConstr/<DATASET NAME>/data
.
You can also download the source data from the links below:
-
BC5CDR: Download the train, development, and test BioCreative V CDR corpus data files.
-
NCBI Disease: Download the complete training, development, and testing sets.
-
LaptopReview: Download the train data V2.0 for the Laptops and Restaurants dataset and the test data - phase B.
-
CoNLL 2003: You can find a pre-processed CoNLL 2003 English dataset here.
Place the downloaded data in the corresponding folders DataConstr/<DATASET NAME>/data
.
To build CoNLL 2003 dataset, you may need to get the external dictionaries and models on which skweak
depends.
You can get these files from here.
Unzip them and place the outputs into DataConstr/Dependency/
for usage.
Run the build.sh
script in the dataset folder DataConstr/<DATASET NAME>
with
./build.sh
You will see train.json
, valid.json
, test.json
and meta.json
files in your target folder if the program runs successfully.
You can also customize the script with your favorite arguments.
Notice: the datasets contructed in the way above are not completely the same as the datasets used in the paper.
However, our code has fully support to the previous version of datasets.
To reproduce the results in the paper, please refer to the dataset construction methods in the prev
branch and link the file location arguments to their directories.
We use the argument parsing techniques from the Huggingface transformers
repo in our program.
It supports the orginary argument parsing approach from shell inputs as well as parsing from json
files.
Conditional hidden Markov model
To train and evaluate CHMM, go to ./LabelModel/
and run
python chmm_train.py config.json
Here conig.json
is just a demo configuration.
You need to fine-tune the hyper-parameters to get better performance.
You can train a fully-supervised BERT-NER model with ground truth labels by going to the ./EndModel/
folder and run
python bert_train.py config.json
The file ./ALT/chmm-alt.py
realizes the alternate training technique introduced in the paper.
you can train a CHMM and a BERT alternately with
./chmm-alt.sh
or
python chmm-alt.py config.json
If you find our work helpful, you can cite it as
@inproceedings{li-etal-2021-bertifying,
title = "{BERT}ifying the Hidden {M}arkov Model for Multi-Source Weakly Supervised Named Entity Recognition",
author = "Li, Yinghao and
Shetty, Pranav and
Liu, Lucas and
Zhang, Chao and
Song, Le",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.482",
doi = "10.18653/v1/2021.acl-long.482",
pages = "6178--6190",
}