Giter Club home page Giter Club logo

superbrucejia / medicalner Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 2.0 12.53 MB

An implementation of several models (BiLSTM-CRF, BiLSTM-CNN, BiLSTM-BiLSTM) for Medical Named Entity Recognition (NER)

License: MIT License

Python 3.96% Makefile 1.80% C++ 17.48% C 1.58% Shell 19.77% CSS 1.10% HTML 9.46% Perl 0.31% JavaScript 0.61% TeX 0.63% Java 0.43% Ruby 0.11% SWIG 0.03% Jupyter Notebook 42.74%
ner medical-informatics character-level-bilstm bilstm-crf named-entity-recognition bilstm-cnn medical

medicalner's Introduction

Medical Named Entity Recognition (MedicalNER)

Abstract

With the development of Medical Artificial Intelligence (AI) System, Natural Language Processing (NLP) has played an essential role to process medical texts and build intelligent machines. Named Entity Recognition (NER), one of the most basic NLP tasks, is primarily studied since it is the cornerstone of the following NLP downstream tasks, e.g., Relation Extraction. In this work, a character-level Bidirectional Long-short Term Memory (BiLSTM)-based models were introduced to tackle the challenge of medical texts. The input character embedding vectors were randomly initialized and then updated during training. The character-level BiLSTM extracted features from medical order-matter sequential data. The followed Conditional Random Field (CRF) predicted the final entity tag. Results have shown that the presented method took advantage of the recurrent architecture and achieved competitive performances for medical texts. Promising results paved the road towards building robust and powerful medical AI engines.


Topic and Study

Task: Named Entity Recognition (NER) implemented using PyTorch

Background: Medical & Clinical Healthcare

Level: Character (and Word) Level

Data Annotation: BIOES tagging Scheme

Method:

  1. CRF++

  2. Character-level BiLSTM + CRF

  3. Character-level BiLSTM + Word-level BiLSTM + CRF

  4. Character-level BiLSTM + Word-level CNN + CRF

Results:

 Results of this work can be downloaded here.

Prerequisities:

 For Word-level models:

 The pre-trained word vectors can be downloaded here.

def load_word_vector(self):
    """
    Load word vectors
    """
    print("Start to load pre-trained word vectors!!")
    pre_trained = {}
    for i, line in enumerate(codecs.open(self.model_path + "word_vectors.vec", 'r', encoding='utf-8')):
        line = line.rstrip().split()
        if len(line) == self.word_dim + 1:
            pre_trained[line[0]] = np.array([float(x) for x in line[1:]]).astype(np.float32)
    return pre_trained

 For Character-level models:

 The Embeddings of characters are randomly initialized and updated by a PyTorch Function, i.e., (nn.Embedding).

self.char_embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=self.char_dim)

Some Statistics Info

Number of entities: 34

No. Entity Number Recognize
1 E95f2a617 3221 ✔︎
2 E320ca3f6 6338 ✔︎
3 E340ca71c 22209 ✔︎
4 E1ceb2bd7 3706 ✔︎
5 E1deb2d6a 9744 ✔︎
6 E370cabd5 6196 ✔︎
7 E360caa42 5268 ✔︎
8 E310ca263 6948 ✔︎
9 E300ca0d0 9490 ✔︎
10 E18eb258b 4526 ✔︎
11 E3c0cb3b4 6280 ✔︎
12 E1beb2a44 1663 ✔︎
13 E3d0cb547 1025 ✔︎
14 E14eb1f3f 406
15 E8ff29ca5 1676 ✔︎
16 E330ca589 1487 ✔︎
17 E89f29333 1093
18 E8ef29b12 217
19 E1eeb2efd 1637 ✔︎
20 E1aeb28b1 209
21 E17eb23f8 670 ✔︎
22 E87f05176 407 ✔︎
23 E88f05309 355 ✔︎
24 E19eb271e 152
25 E8df2997f 135
26 E94f2a484 584 ✔︎
27 E13eb1dac 58
28 E85f04e50 6
29 E8bf057c2 7
30 E8cf297ec 6
31 E8ff05e0e 6 ⨉︎
32 E87e38583 18 ⨉︎
33 E86f04fe3 6 ⨉︎
34 E8cf05955 64 ⨉︎

train data: 6494   vocab size: 2258   unique tag: 74  

dev data: 865   vocab size: 2258   unique tag: 74  

data: number of sentences vocab: character vocabulary unique tag: number of (prefix + entities)


Structure of the code

At the root of the project, you will see:

├── data
|  └── train # Training set 
|  └── val # Validation set 
|  └── test # Testing set 
├── models
|  └── data.pkl # Containing all the used data, e.g., look-up table
|  └── params.pkl # Saved PyTorch model
├── preprocess-data.py # Preprocess the original dataset
├── data_manager.py # Load the train/val/test data
├── model.py # BiLSTM-CRF with Attention Model
├── main.py # Main codes for the training and prediction
├── utils.py # Some functions for prediction stage and evaluation criteria
├── config.yml # Contain the hyper-parameters settings

Basic Model Architecture

    Character Input
          |                         
     Lookup Layer  <----------------|    Update Character Embedding
          |                         |
     Bi-LSTM Model  <---------------|        Extract Features
          |                         |     Back-propagation Errors
     Linear Layer  <----------------|   Update Trainable Parameters
          |                         |
       CRF Model  <-----------------|    
          |                         |
Output corresponding tags  ---> [NLL Loss] <---  Target tags

Limitations

  1. Currently only support CPU training

    GPU is much more slower than the CPU as a result of the viterbi decode's FOR LOOP.

  2. Cannot recognize entities with fewer examples (< 500 samples)


Final Results

Overall F1 score on 18 entities:

Separate F1 score on each entity:

No. Entity Number F1 Score
1 E95f2a617 3221 ✔︎
2 E320ca3f6 6338 ✔︎
3 E340ca71c 22209 ✔︎
4 E1ceb2bd7 3706 ✔︎
5 E1deb2d6a 9744 ✔︎
6 E370cabd5 6196 ✔︎
7 E360caa42 5268 ✔︎
8 E310ca263 6948 ✔︎
9 E300ca0d0 9490 ✔︎
10 E18eb258b 4526 ✔︎
11 E3c0cb3b4 6280 ✔︎
12 E1beb2a44 1663 ✔︎
13 E3d0cb547 1025 ✔︎
14 E8ff29ca5 1676 ✔︎
15 E330ca589 1487 ✔︎
16 E1eeb2efd 1637 ✔︎
17 E17eb23f8 670 ✔︎
18 E94f2a484 584 ✔︎

Hyperparameters settings

Name Value
embedding_size 30 / 40 / 50 / 100
hidden_size 128 / 256
batch_size 8/ 16 / 32 / 64
dropout rate 0.50 / 0.75
learning rate 0.01 / 0.001
epochs 100
weight decay 0.0005
max length 100 / 120

Model Deployment

The BiLSTM + CRF model has been deployed using Docker + Flask as a webapp.

The codes and demos were open-sourced in this repo.

Screenshot of the model output:


Reference

Traditional Methods for NER: BiLSTM + CNN + CRF

  1. Neural Architectures for Named Entity Recognition

  2. Log-Linear Models, MEMMs, and CRFs

  3. Named Entity Recognition with Bidirectional LSTM-CNNs

  4. A Survey on Deep Learning for Named Entity Recognition

SOTA Method for NER (I think)

  1. Lattice LSTM

  2. 中文NER的正确打开方式: 词汇增强方法总结 (从Lattice LSTM到FLAT)

  3. 工业界如何解决NER问题


Licence

MIT Licence

medicalner's People

Contributors

superbrucejia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

medicalner's Issues

Pre-trained vectors not available

Hello,

I am unable to download pre-trained word vectors. It says that file is not available anymore. Can you please provide that?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.