Giter Club home page Giter Club logo

biowordvec's Introduction

BioWordVec: Improving Biomedical Word Embeddings with Subowrd Information and MeSH

This sourcecode is a demo implementation described in the paper "BioWordVec:Improving Biomedical Word Embeddings with Subowrd Information and MeSH." This is research software, provided as is without express or implied warranties etc. see licence.txt for more details. We have tried to make it reasonably usable and provided help options, but adapting the system to new environments or transforming a corpus to the format used by the system may require significant effort.

Data files

Data: MeSH_graph.edgelist is the MeSH main-heading graph file. MeSH_dic.pkl.gz is used to align the MeSH heading ids with mention words. The PubMed corpus and MeSH RDF data can be download from NCBI.

Prerequisites

  • python 3.5
  • networkx 1.11
  • gensim 2.3

Usage

User can use BioWordVec.py to automatically learn the biomedical word embedding based on PubMed text corpus and MeSH data.

Pre-trained word embedding

We created two specialized, task-dependent sets of word embeddings “Bio-embedding-intrinsic” and “Bio-embedding-extrinsic” via setting the context window size as 20 and 5, respectively. The pre-trained BioWordVec data are freely available on Figshare. "Bio-embedding-intrinsic" is for intrinsic tasks and used to calculate or predict semantic similarity between words, terms or sentences. "Bio_embedding_extrinsic" is for extrinsic tasks and used as the input for various downstream NLP tasks, such as relation extraction or text classification. Both sets are in binary format and contain 2,324,849 distinct words in total. All words were converted to lowercase and the number of dimensions is 200.

We used UMNSRS datasets to evaluate the pre-trained word embeddings on medical word pair similarity.

Word embeddings UMNSRS-Sim (Pearson score) UMNSRS-Sim (Spearman score) UMNSRS-Rel (Pearson score) UMNSRS-Rel (Pearson score)
Pyysalo et al. 0.662 0.652 0.600 0.601
Chiu et al. 0.665 0.654 0.608 0.607
BioWordVec (win20) 0.667 0.657 0.619 0.617

We also used BioCreative/OHNLP STS dataset to evaluate the pre-trained word embeddings on clinical sentence pair similarity.

Similarity measures Pyysalo et al. Chiu et al. BioWordVec (win20)
Cosine 0.755 0.757 0.771
Euclidean 0.723 0.727 0.753
Block 0.722 0.727 0.752

User can find more usage notes in our paper.

References

When using some of our pre-trained models for your application, please cite the following paper:

Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data. 2019.

List of Contributors

Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin and Zhiyong Lu

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. We are grateful to the authors of fastText, Node2vec and UMNSRS for making their software and data publicly available.

biowordvec's People

Contributors

qingyu-qc avatar zhangyijia1979 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biowordvec's Issues

Regarding the bio_embedding_intrinsic download file

I see the following error while loading the model
UnpicklingError: unpickling stack underflow

Looks like this can be because of the old scipy format of the saved file. Is there a way to get the txt file format?

Thanks
Abhishek

Unexpected keyword argument 'size'

Using:

  • Python 3.9.6
  • Gensim 4.1.2
  • Networkx 2.6.3

Running into an issue whilst executing the default BioWordVed.py file using only the default arguments.

Walk iteration:
1 / 2
2 / 2
Traceback (most recent call 
  File "F:\Projects\BioWordV
    main(args)
  File "F:\Projects\BioWordV
    model = FastText(MySentes, min_count=args.min_count,
TypeError: __init__() got an unexpected keyword argument 'size'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.