Giter Club home page Giter Club logo

biosentvec's Introduction

BioWordVec & BioSentVec:
pre-trained embeddings for biomedical words and sentences

Table of contents

Text corpora

We created biomedical word and sentence embeddings using PubMed and the clinical notes from MIMIC-III Clinical Database. Both PubMed and MIMIC-III texts were split and tokenized using NLTK. We also lowercased all the words. The statistics of the two corpora are shown below.

Sources Documents Sentences Tokens
PubMed 28,714,373 181,634,210 4,354,171,148
MIMIC III Clinical notes 2,083,180 41,674,775 539,006,967

We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms). This work extends the original BioWordVec which provides fastText word embeddings trained using PubMed and MeSH. We used the same parameters as the original BioWordVec which has been thoroughly evaluated in a range of applications.

We evaluated BioWordVec for medical word pair similarity. We used the MayoSRS (101 medical term pairs; download here) and UMNSRS_similarity (566 UMLS concept pairs; download here) datasets.

Model MayoSRS UMNSRS_similarity
word2vec 0.513 0.626
BioWordVec model 0.552 0.660

We applied sent2vec to compute the 700-dimensional sentence embeddings. We used the bigram model and set window size to be 20 and negative examples 10.

We evaluated BioSentVec for clinical sentence pair similarity tasks. We used the BIOSSES (100 sentence pairs; download here) and the MedSTS (1068 sentence pairs; download here) datasets.

BIOSSES MEDSTS
Unsupervised methods
    doc2vec 0.787 -
    Levenshtein Distance - 0.680
    Averaged word embeddings 0.694 0.747
    Universal Sentence Encoder 0.345 0.714
    BioSentVec (PubMed) 0.817 0.750
    BioSentVec (MIMIC-III) 0.350 0.759
    BioSentVec (PubMed + MIMIC-III) 0.795 0.767
Supervised methods
    Linear Regression 0.836 -
    Random Forest - 0.818
    Deep learning + Averaged word embeddings 0.703 0.784
    Deep learning + Universal Sentence Encoder 0.401 0.774
    Deep learning + BioSentVec (PubMed) 0.824 0.819
    Deep learning + BioSentVec (MIMIC-III) 0.353 0.805
    Deep learning + BioSentVec (PubMed + MIMIC-III) 0.848 0.836

FAQ

You can find answers to frequently asked questions on our Wiki; e.g., you can find the instructions on how to load these models.

You can also find this tutorial on how to use BioSentVec for a quick start.

References

When using some of our pre-trained models for your application, please cite the following papers:

  1. Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data. 2019.
  2. Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. The 7th IEEE International Conference on Healthcare Informatics. 2019.

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. We are grateful to the authors of fastText, sent2vec, MayoSRS, UMNSRS, BIOSSES, and MedSTS for making their software and data publicly available.

biosentvec's People

Contributors

kaushikacharya avatar qingyu-chen avatar qingyu-qc avatar yfpeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biosentvec's Issues

gensim - read bin file?

Hi everyone,

thank you for making this resource public; it is a great help to the community.

I am having issues loading the .bin (model) file with gensim. Code snippet follows:

from gensim.models.fasttext import *
gensim_fasttext_model = load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")

Although the above snippet uses CPU/RAM resources, it never loads the model (seems it is constantly loading) nor does it produce an error.

When I try to load it with the fastText library, it loads it within 90ish seconds. Code snippet:

import fastText as fasttext
fasttext.load_model(root_path+'models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin')

Unfortunatelly, I would prefer to use the gensim approach as it enables the getitem to generate representations (e.g. model['word_to_represent']).

I can load the vec.bin file (pretrained word-embedding mapping) with
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

but that does not help me in my current pipeline (as its not dealing with OOVs).

Could you provide any help/guidance what might be wrong? My env ist:

Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1

BioWordVec - how to handle phrases

The MayoSRS and UMNSRS_similarity datasets mostly contain phrases. Did you use mean pooling to get the results you reported or some other pooling mechanism for ngrams longer than 1?

Thanks

name 'stdvector_base' is not defined when calling sent2vec.Sent2vecModel.embed_sentence()

NameError Traceback (most recent call last)
in
3 for line in pos:
4 print(line)
----> 5 sentence_vector = model.embed_sentence(line)
6 pos_arrays[i] = sentence_vector
7 pos_labels[i] = 1

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentence()

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentences()

src/sent2vec.pyx in sent2vec.vector_wrapper.asarray()

NameError: name 'stdvector_base' is not defined

I went to the code of stdvector_base and only see "pass" in the class definition. I added the default constructor but doesn't help to resolve the issue. Any suggestion? This April, I could run it successfully but it cannot work now.

terminate called after throwing an instance of 'std::bad_alloc' in AWS EFS while loading BioSentVec

I built a Tensorflow model on top. of BioSentVec Emebeddinngs. Now when I am trying to deploy the model, I need BioSentevec at inference time to preprocess the inputs.

I am trying to deploy. the model on AWS. using Lambda and EFS.

I have mounted the lambda on EFS and getting the following error when I try to load. the model -

terminate called after throwing an instance of 'std::bad_alloc'

Here is the Stackoverflow issue I have created - https://stackoverflow.com/questions/63817981/terminate-called-after-throwing-an-instance-of-stdbad-alloc-in-aws-efs-while?noredirect=1#comment112852352_63817981

Can someone guide me as to what is happening? Is this due to 3 GB Ram limitation on Lambda?

IF that is so, so the only option to deploy a model that uses BioSentvec would be to use an EC2 instance?

PubMed corpus only trained model for BioSentVec

Hi all, thank for providing the embeddings for the BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III). Awesome work that helped me a lot!

I'd like to have the BioSentVec model trained only on the PubMed corpus. Did you train such a model too, or only the combined model with the corpora of PubMed+MIMIC-III?

I have tried the following model: BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III), but I have data from MIMIC-III in my test dataset and therefore a conflicting situation.

I appreciate an answer from you.

BIOSSES and MEDSTS results

Hi
Interesting paper and approach, however I am kind of confused on how to reproduce the results on both datasets. More importantly, the paper mentions (as far as I understand) using 5 layer deep neural network trained on the embedding generated by BioSentVec. Isnt the dataset size too small for deep networks and is it possible to share the training code.

MedSTS dataset

Hi

Where can I find the MEDSTS dataset? Also to evaluate the dataset on the BioSentVec model do I need any pre-processing?

Thanks
Abhishek

Medical antonyms

The following is not a bug in your code. Rather, wondering if any have thoughts on the following.
I'm working on some NLP tasks in oncology.
Have found that randomly initialized word/sentence embeddings tend to work better than any pretrained embeddings for ultimately classifying, say, improving vs worsening cancer.
Had an intuition that this might be because key words for telling the two apart tend to be embedded similarly.

In trying BioSentVec, this seems to be borne out, eg:
progression = model.embed_sentence(preprocess_sentence("Increase in size of tumor"))
response = model.embed_sentence(preprocess_sentence("Decrease in size of tumor"))
1 - distance.cosine(progression, response)

Yields: 0.94. Opposite meanings are embedded similarly, which explains why building classifiers on these embeddings does not work well.

Any methods for addressing this in the transfer learning setting that you're aware of? I have not found any---

How can I load the bio sent2vec model due to loss of RAM?

I just want to use the pretrained model of bio sent2vec in my local machine with almost 12 Gigabyte RAM, and the model size is almost 22 Gigabyte, and when I am trying to load model using sent2vec, it stuck and the result is nothing and the reason would be the model probably is completely loading and there is not enough memory.is there any way so that I can load the model under 12 GIgabyte RAM?

MedSTS

Where can I find the dataset?

"Model file cannot be opened for loading!" for BioSentVec

I have Python 3.8.5 and installed fasttext, as well as sent2vec. But when I try to load the model, python crashes with a single message "Model file cannot be opened for loading!". Full installation code is:

conda create -n sent2vec "python==3.8.5"
conda activate sent2vec
pip install Cython
git clone https://github.com/facebookresearch/fastText.git
git clone https://github.com/epfml/sent2vec.git
cd fastText
pip install .
cd ../sent2vec
pip install .

Then, python code is:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model("./BioSentVec_PubMed_MIMICIII-bigram_d700.bin")

What could be wrong?

Unable to handle negation of sentences

When I calculated similarity between "disease causing" and "not disease causing" using BioSentVec. It gave "1" , but I think it should give close to zer0. Kindly have a look at following:

from scipy.spatial import distance
sentence_vector1 = model.embed_sentence(preprocess_sentence("disease causing")  )
sentence_vector2 = model.embed_sentence(preprocess_sentence("not disease causing")  )
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print( cosine_sim )  # this will print 1

Make source corpora available

Would it be possible for you to make your source corpora available (both raw & preprocessed (tokenized / sentence split)? Would be very useful in helping folks create resources with other methods.

Context Vectors for words

Hello,

I wanted to use context embeddings for words to perform word similarity tasks. Is there a way to get the context vectors for words using the fastText model file.

Thanks,
Aditya

Invalid words in vocabulary?

While exploring nearest neighbors, have seen many words which seem to be invalid words.

import fasttext
model = fasttext.load_model('./BioSentVec/models/BioWordVec_PubMed_MIMICIII_d200.bin')
model.get_nearest_neighbors('kidney', 30)

This gives the following output:

[(0.9160109162330627, u'kidney*'),
(0.9024562239646912, u'kidney=='),
(0.8989526033401489, u'kidneyks'),
(0.8850656747817993, u'kidney=5'),
(0.8817461133003235, u'kidney-kidney'),
(0.878646731376648, u'1kidney'),
(0.8774275183677673, u'2kidney'),
(0.87574702501297, u'kidney.6'),
(0.8753364682197571, u'qkidney'),
(0.8732652068138123, u'vkidney'),
(0.8732365369796753, u'kidney=48'),
(0.8726592659950256, u'kidneyx2'),
(0.8723018765449524, u'kidney.7'),
(0.8717607259750366, u'kidney2'),
(0.8697896003723145, u'kidneyys'),
(0.8693934679031372, u'kidneyl'),
(0.8692406415939331, u'kidneyds'),
(0.8683575987815857, u'lkidney'),
(0.8680075407028198, u'kidney*liver'),
(0.8666210174560547, u'kidney.5'),
(0.866155207157135, u'e1kidney'),
(0.8647593855857849, u'ckidney'),
(0.8646546006202698, u'ekidney'),
(0.8636531233787537, u'kidney~the'),
(0.861596941947937, u'kidney.2'),
(0.8599884510040283, u'dkidney'),
(0.8594629764556885, u'kidney.3'),
(0.8585801124572754, u'=kidney'),
(0.8581786155700684, u'vtkidney'),
(0.858029305934906, u'kidneywith')]

Wondering what are these words.
Are these coming from acupuncture points?
e.g. kidney2, kidney.2 - Do these represent http://www.acupuncture.com/education/points/kidney/kid2.htm ?

  • Even if that's the case, is it correct to generate words from the phrase kidney 2 ?
  • Or is that pre-processing wasn't done properly?

But when I use FastText's model, it returns expected nearest words:

model = fasttext.load_model('./models/cc.en.300.bin')
model.get_nearest_neighbors('kidney', 30)
[(0.7705090045928955, u'renal'),
(0.7571945786476135, u'kidneys'),
(0.7136564254760742, u'Kidney'),
(0.6960737109184265, u'kindey'),
(0.6932832598686218, u'liver'),
(0.6215611100196838, u'gallbladder'),
(0.6096128225326538, u'kidney-'),
(0.592450737953186, u'kidney-related'),
(0.5883890390396118, u'lung'),
(0.5875317454338074, u'Renal'),
(0.5851610898971558, u'kidney.'),
(0.580848753452301, u'dialysis'),
(0.5669795870780945, u'Kidneys'),
(0.565768301486969, u'pre-renal'),
(0.5617753267288208, u'hydronephrotic'),
(0.5602078437805176, u'non-renal'),
(0.5586943030357361, u'extra-renal'),
(0.557516872882843, u'ureter'),
(0.5568706393241882, u'hydronephrosis'),
(0.5556935667991638, u'nephrosis'),
(0.5507169961929321, u'extrarenal'),
(0.5478389859199524, u'bladder'),
(0.5455406904220581, u'nephritis'),
(0.540409505367279, u'pancreas'),
(0.538938045501709, u'gall-bladder'),
(0.5365235805511475, u'TEENney'),
(0.5338416695594788, u'pancreatic'),
(0.5323835611343384, u'ureteric'),
(0.5321975946426392, u'glomerular'),
(0.5308919548988342, u'prerenal')]

find similar sentences

Would be possible to find similar sentences using the sent2vec model?
How to use the BioSentVec model to query for similar sentences?

Question on calculate the similarity of UMNSRS?

Thank you for the great embedding. I have a few questions on how to calculate the similarity of UMNSRS.

Per my understanding, BioWordVec is a word embedding. Each word is represented to a vector. However, there are some phrases that contain more than one word in UMNSRS. Did you get the average of each word in the phrase and then calculate the cosine similarity?

One more question is how do you deal with the words that are not in the vocabulary? e.g. I found:

(ana)
arthriits
buterfly
varicsoe
haletosis

are not in the vocabulary. Did you impute something or just discard those terms.

One more question is I found the window size in .sh is 30. You describe you were using 20 for extrinsic task. Which one will yield a better result?

Thank you and looking forward to reply!

Add generation code

It's really great that you have provided pretrained embeddings. For completeness, please also add the code written to generate them. It will serve as a useful technical example for someone to improve upon. Thanks.

Question: when you specify limit, does it start with most frequently found words?

When you load it like this, will these be the top 4E5 words found during training? I believe other vector bins like Google News work like this. Thank you!

word2vec = gensim.models.KeyedVectors.load_word2vec_format(
                'data/BioWordVec_PubMed_MIMICIII_d200.vec.bin',
                binary=True,
                limit=int(4E5) # faster load if you limit to most frequent terms?
            )

Months/Year of the PubMed corpus?

Hi,
Thanks for this very valuable resource. I would like to know the month/year of downloading of the PubMed corpus used to train the BioWordVec models? That is, articles in PubMed till what day/month/year were used to build the BioWordVec models?

Thank you,
Mani

Links to vec/model file wrong?

Hi,

thank you for sharing this with the rest of us! Its already coming in handy ;)

Just a minor issue (and not sure if it is one); did you by mistake link the "opposite" files to respective links:

BioWordVec vector 13GB (200dim, trained on PubMed+MIMIC-III, word2vec bin format) -> downloads the bin file, size 27GB

BioWordVec model 26GB (200dim, trained on PubMed+MIMIC-III) -> downloadds the vec.bin file, size 13GBs.

Best,
J.

How to import sent2vec

I want to use the pretrained BioSentVec to extract sentence vector. I am following the following codes and ran into the error: "no module named 'sent2vec'". Do you know how to resolve this?

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")

I have done the following steps:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.