Giter Club home page Giter Club logo

biomedical_ner's Introduction

biomedical_ner

Named Entity Extraction - ADE, Reason, and Drug mention extraction - from clinical notes. Dataset = N2C2 2018- Track 2

Pre-requisites

  • Cleanup
    • Incorrect annotations like 104095.ann= T17 Drug 4305 4314;4314 4315 Olanzapin e --- not in next line in txt file
    • Random characters in ann file like "" (102027, line 410)
  • Data
    • track2-training_data_2
    • track2-training_data_3(put files directly inside this folder)
    • gold-standard-test-data(put files directly inside this folder)
  • Libraries
    • preprocessing packages= ntlk, pandas, nltk.download('punkt')
    • training packages(included in notebook)
    • testing packages(included in notebook)

Project structure

  • code
    • preprocess= script to process raw data and prepare training and testing data. Involves nltk tokenization, tag filter overlap, sentence ID construction and entity statistics.
    • train= training notebook for BERT-LR model(BERT baseline)
    • test= testing notebook for BERT-LR model
    • train-crf= training notebook for BERT-CRF model(ours)
    • train-crf= testing notebook for BERT-CRF model
  • data
    • track2-training_data_2= raw training data
    • gold-standard-test-data= raw test data
    • track2-training_data_3= raw validation data
    • train= processed training data
    • val= processed validation data
    • test= processed testing data
  • output
    • v4= results for BERT-LR(ours- old, BERT baseline)
    • v5= results for BERT-CRF(ours)

Steps

  • run preprocess.py to generate data folders
  • run train-crf.ipynb notebook to train BERT-CRF model
  • run test-crf.ipynb notebook to test model and generate results

Note- TAG_OVERLAP_FILES(more than one entity in a span) = {"101136", "100187", "129286", "106621", "108889", "110384", "122093", "110342", "111458", "113705", "178143", "113840", "110863", "102557", "110037", "119144", "111882", "100883", "114965", "117745", "109191", "113524", "113824", "116966", "105050", "123475", "110499", "192798", "100590", "101427", "103722", "100677", "103315", "112628", "115021", "107322", "134445", "100229", "114452", "105778", "103142", "115267", "105360", "111298", "109527", "116901", "185800", "114923", "114680", "110521", "103293", "107255", "103926", "141586", "105106", "109176", "119573", "115667", "157609", "107202", "108539", "107515", "115138", "106015", "109397", "121861", "181643", "104446", "110559", "126085", "112226", "121631", "106334", "123807", "106915", "116204", "105585", "189637", "115169", "119906", "111160", "103761", "103430", "106993", "107128", "105537", "146944", "113222", "106945", "115433", "105579", "109698", "104929", "142444", "112329", "109724"}

biomedical_ner's People

Contributors

soujanyarbhat avatar varunchaudharycs avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.