Giter Club home page Giter Club logo

aioner's Introduction

AIONER


Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications. In this work, we propose AIONER, a new BioNER tagger that takes full advantage of various existing datasets for recognizing multiple entities simultaneously, through a novel all-in-one (AIO) scheme. This repo contains the source code and dataset for the AIONER.

Content

Dependency package

The codes have been tested using Python 3.7 on CentOS and uses the following main dependencies on a CPU and GPU:

To install all dependencies automatically using the command:

$ pip install -r requirements.txt

Introduction of folders

  • data
    • pubtator: the datasets in pubtator format
    • conll: the datasets in CoNLL format
  • example: input example files, BioC and PubTator formats (abstract or full text)
  • src: the codes for AIONER
    • AIONER_Run.py: the script for running AIONER
    • AIONER_Training.py: the script for training AIONER model
    • AIONER_FineTune.py: the script for fine-tuning AIONER model for the new NER task
    • Format_Preprocess.py: the preprocesing script to covert the pubtator format to conll format
  • vocab: label files for NER

Pre-trained model preparation

To run this code, you need to first download the model file ( it includes the files for two original pre-trained models and two AIONER models), then unzip and put the model folder into the AIONER folder.

  • bioformer-cased-v1.0: the original bioformer model files
  • BiomedNLP-PubMedBERT-base-uncased-abstract: the original PubMedBERT model files
  • AIONER:
    • PubmedBERT-CRF-AIONER.h5: the PubMedBERT-CRF AIONER model
    • Bioformer-softmax-AIONER.h5: the Bioformer-softmax AIONER model

Running AIONER

Use our trained models (i.e., PubmedBERT/Bioformer) for running AIONER by AIONER_Run.py.

The file has 5 parameters:

  • --inpath, -i, help="input folder"
  • --model, -m, help="trained AIONER model file"
  • --vocabfile, -v, help="vocab file with BIO label"
  • --entity, -e, help="predict entity type (Gene, Chemical, Disease, Mutation, Species, CellLine, ALL)"
  • --outpath, -o, help="output folder to save the AIONER tagged results"

The input file format is BioC(xml) or PubTator format. You can find some input examples in the /example/input/ folder .

Run Example:

$ python AIONER_Run.py -i ../example/input/ -m ../pretrained_models/AIONER/Bioformer-softmax-AIONER.h5 -v ../vocab/AIO_label.vocab -e ALL -o ../example/output/

Training a new AIONER model

You can train a new AIONER model using the AIONER_Training.py file.

The file has 5 parameters:

  • --trainfile, -t, help="the training set file in CoNLL format"
  • --valfile, -v, help="the validation set file in CoNLL format (optional)"
  • --encoder, -e, help="the encoder of model (bioformer or pubmedbert)?"
  • --decoder, -d, help="the decoder of model (crf or softmax)?"
  • --outpath, -o, help="the model output folder"

Note that --valfile is an optional parameter. When the validation set is provided, the model training will early stop by the performance on the validation. If no, the model training will early stop by the accuracy of training set.

Run Example:

$ python NER_Training.py -t ../data/AIONER/AIONER_Train-ALL.conll  -e bioformer -d softmax -o ../models/

After the training is finished, the trained model (e.g., bioformer-softmax-es-AIO.h5) will be generated in the output folder. If the development set is provided, two trained models (bioformer-softmax-es-AIO.h5 for early stopping by the accuracy of training set; bioformer-softmax-best-AIO.h5 for early stopping by the performance on the validation set) will be generated in the output folder.

Fine-tune for a new NER task

Use our pretrained AIONER models for fine-tuning a new NER task.

First, you need to fine-tune the model using the new training set by AIONER_FineTune.py file.

The file has 5 parameters:

  • --trainfile, -t, help="the training set file"
  • --devfile, -d, help="the development set file"
  • --vocabfile, -v, help="vocab file with BIO label"
  • --modeltype, -m, help="deep learning model (bioformer or pubmedbert?)"
  • --outpath, -o, help="the fine-tuned model output folder"

Note that, the input file is conll format with adding tags. You can covert the pubtator format to conll format using the Format_Preprocess.py file. Moreover, --devfile is an optional parameter. When the development set is provided, the model training will early stop by the performance on the development. If no, the model training will early stop by the accuracy of training set.

Run Example:

$ python AIONER_FineTune.py -t ../data/conll/AnEM_train.conll -v ../vocab/AnEM_label.vocab -m bioformer -o ../models/

After the training is finished, the trained model (e.g., bioformer-softmax-es-finetune.h5) will be generated in the output folder. If the development set is provided, two trained models (bioformer-softmax-es-finetune.h5 for early stopping by the accuracy of training set; bioformer-softmax-best-finetune.h5 for early stopping by the performance on the validation set) will be generated in the output folder.

Then you can use the fine-tune model for tagging by AIONER_Run.py file.

Run Example:

$ python AIONER_Run.py -i ../example/input/ -m bioformer-softmax-es-finetune.h5 -v ../vocab/AnEM_label.vocab -e ALL -o example/output/

Format conversion

You can covert the pubtator format to conll format using the Format_Preprocess.py file.

The file has 3 parameters:

  • --inpath, -i, help="the input folder of training set files in Pubtator format"
  • --mapfile, -m, help="the mapfile to coversion"
  • --outpath, -o, help="the output folder of the files in CoNLL format"

Note that --mapfile is a file to guide the entity type and label. For example, in the file of "list_file.txt":

"NCBIdisease.PubTator Disease Disease:DiseaseClass|SpecificDisease|CompositeMention|Modifier" denotes the input pubtator file is "NCBIdisease.PubTator", the AIONER-label is "Disease", and the entities with entity types of "DiseaseClass, SpecificDisease, CompositeMention, Modifier" are used and changed to a new entity type is "Disease".

Run Example:

$ python Format_Preprocess.py -i ../data/pubtator/ -m ../data/list_train.txt -o ../data/conll/

After the conversion, the all conll files used for training need to merge into a file.

Acknowledgments

This research was supported by the Intramural Research Program of the National Library of Medicine (NLM), National Institutes of Health.

Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.


aioner's People

Contributors

lingluodlut avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.