Medical Named Entity Recognition (MedicalNER)

Abstract

With the development of Medical Artificial Intelligence (AI) System, Natural Language Processing (NLP) has played an essential role to process medical texts and build intelligent machines. Named Entity Recognition (NER), one of the most basic NLP tasks, is primarily studied since it is the cornerstone of the following NLP downstream tasks, e.g., Relation Extraction. In this work, a character-level Bidirectional Long-short Term Memory (BiLSTM)-based models were introduced to tackle the challenge of medical texts. The input character embedding vectors were randomly initialized and then updated during training. The character-level BiLSTM extracted features from medical order-matter sequential data. The followed Conditional Random Field (CRF) predicted the final entity tag. Results have shown that the presented method took advantage of the recurrent architecture and achieved competitive performances for medical texts. Promising results paved the road towards building robust and powerful medical AI engines.

Topic and Study

Task: Named Entity Recognition (NER) implemented using PyTorch

Background: Medical & Clinical Healthcare

Level: Character (and Word) Level

Data Annotation: BIOES tagging Scheme

Method:

CRF++
Character-level BiLSTM + CRF
Character-level BiLSTM + Word-level BiLSTM + CRF
Character-level BiLSTM + Word-level CNN + CRF

Results:

Results of this work can be downloaded here.

Prerequisities:

For Word-level models:

The pre-trained word vectors can be downloaded here.

def load_word_vector(self):
    """
    Load word vectors
    """
    print("Start to load pre-trained word vectors!!")
    pre_trained = {}
    for i, line in enumerate(codecs.open(self.model_path + "word_vectors.vec", 'r', encoding='utf-8')):
        line = line.rstrip().split()
        if len(line) == self.word_dim + 1:
            pre_trained[line[0]] = np.array([float(x) for x in line[1:]]).astype(np.float32)
    return pre_trained

For Character-level models:

The Embeddings of characters are randomly initialized and updated by a PyTorch Function, i.e., (nn.Embedding).

self.char_embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=self.char_dim)

Some Statistics Info

Number of entities: 34

No.	Entity	Number	Recognize
1	E95f2a617	3221	✔︎
2	E320ca3f6	6338	✔︎
3	E340ca71c	22209	✔︎
4	E1ceb2bd7	3706	✔︎
5	E1deb2d6a	9744	✔︎
6	E370cabd5	6196	✔︎
7	E360caa42	5268	✔︎
8	E310ca263	6948	✔︎
9	E300ca0d0	9490	✔︎
10	E18eb258b	4526	✔︎
11	E3c0cb3b4	6280	✔︎
12	E1beb2a44	1663	✔︎
13	E3d0cb547	1025	✔︎
14	E14eb1f3f	406	⨉
15	E8ff29ca5	1676	✔︎
16	E330ca589	1487	✔︎
17	E89f29333	1093	⨉
18	E8ef29b12	217	⨉
19	E1eeb2efd	1637	✔︎
20	E1aeb28b1	209	⨉
21	E17eb23f8	670	✔︎
22	E87f05176	407	✔︎
23	E88f05309	355	✔︎
24	E19eb271e	152	⨉
25	E8df2997f	135	⨉
26	E94f2a484	584	✔︎
27	E13eb1dac	58	⨉
28	E85f04e50	6	⨉
29	E8bf057c2	7	⨉
30	E8cf297ec	6	⨉
31	E8ff05e0e	6	⨉︎
32	E87e38583	18	⨉︎
33	E86f04fe3	6	⨉︎
34	E8cf05955	64	⨉︎

train data: 6494 vocab size: 2258 unique tag: 74

dev data: 865 vocab size: 2258 unique tag: 74

data: number of sentences vocab: character vocabulary unique tag: number of (prefix + entities)

Structure of the code

At the root of the project, you will see:

├── data
|  └── train # Training set 
|  └── val # Validation set 
|  └── test # Testing set 
├── models
|  └── data.pkl # Containing all the used data, e.g., look-up table
|  └── params.pkl # Saved PyTorch model
├── preprocess-data.py # Preprocess the original dataset
├── data_manager.py # Load the train/val/test data
├── model.py # BiLSTM-CRF with Attention Model
├── main.py # Main codes for the training and prediction
├── utils.py # Some functions for prediction stage and evaluation criteria
├── config.yml # Contain the hyper-parameters settings

Basic Model Architecture

    Character Input
          |                         
     Lookup Layer  <----------------|    Update Character Embedding
          |                         |
     Bi-LSTM Model  <---------------|        Extract Features
          |                         |     Back-propagation Errors
     Linear Layer  <----------------|   Update Trainable Parameters
          |                         |
       CRF Model  <-----------------|    
          |                         |
Output corresponding tags  ---> [NLL Loss] <---  Target tags

Limitations

Currently only support CPU training

GPU is much more slower than the CPU as a result of the viterbi decode's FOR LOOP.
Cannot recognize entities with fewer examples (< 500 samples)

Final Results

Overall F1 score on 18 entities:

Separate F1 score on each entity:

No.	Entity	Number	F1 Score
1	E95f2a617	3221	✔︎
2	E320ca3f6	6338	✔︎
3	E340ca71c	22209	✔︎
4	E1ceb2bd7	3706	✔︎
5	E1deb2d6a	9744	✔︎
6	E370cabd5	6196	✔︎
7	E360caa42	5268	✔︎
8	E310ca263	6948	✔︎
9	E300ca0d0	9490	✔︎
10	E18eb258b	4526	✔︎
11	E3c0cb3b4	6280	✔︎
12	E1beb2a44	1663	✔︎
13	E3d0cb547	1025	✔︎
14	E8ff29ca5	1676	✔︎
15	E330ca589	1487	✔︎
16	E1eeb2efd	1637	✔︎
17	E17eb23f8	670	✔︎
18	E94f2a484	584	✔︎

Hyperparameters settings

Name	Value
embedding_size	30 / 40 / 50 / 100
hidden_size	128 / 256
batch_size	8/ 16 / 32 / 64
dropout rate	0.50 / 0.75
learning rate	0.01 / 0.001
epochs	100
weight decay	0.0005
max length	100 / 120

Model Deployment

The BiLSTM + CRF model has been deployed using Docker + Flask as a webapp.

superbrucejia / medicalner Goto Github PK

medicalner's Introduction

Medical Named Entity Recognition (MedicalNER)

Abstract

Topic and Study

Some Statistics Info

Structure of the code

Basic Model Architecture

Limitations

Final Results

Hyperparameters settings

Model Deployment

Screenshot of the model output:

Reference

Traditional Methods for NER: BiLSTM + CNN + CRF

SOTA Method for NER (I think)

Licence

medicalner's People

Contributors

Stargazers

Watchers

Forkers

medicalner's Issues

Recommend Projects

Recommend Topics

Recommend Org