ncbi-nlp / biosentvec Goto Github PK

BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences

License: Other

Shell 0.99% Python 19.66% Jupyter Notebook 79.35%

natural-language-processing bionlp fasttext sent2vec sentence-embeddings word-embeddings sentence-similarity pubmed mimic-iii

biosentvec's Introduction

BioWordVec & BioSentVec:
pre-trained embeddings for biomedical words and sentences

Text corpora
BioWordVec: biomedical word embeddings with fastText
BioSentVec: biomedical sentence embeddings with sent2vec
FAQ
References
Acknowledgments

Text corpora

We created biomedical word and sentence embeddings using PubMed and the clinical notes from MIMIC-III Clinical Database. Both PubMed and MIMIC-III texts were split and tokenized using NLTK. We also lowercased all the words. The statistics of the two corpora are shown below.

Sources	Documents	Sentences	Tokens
PubMed	28,714,373	181,634,210	4,354,171,148
MIMIC III Clinical notes	2,083,180	41,674,775	539,006,967

BioWordVec [1]: biomedical word embeddings with fastText

We applied fastText to compute 200-dimensional word embeddings. We set the window size to be 20, learning rate 0.05, sampling threshold 1e-4, and negative examples 10. Both the word vectors and the model with hyperparameters are available for download below. The model file can be used to compute word vectors that are not in the dictionary (i.e. out-of-vocabulary terms). This work extends the original BioWordVec which provides fastText word embeddings trained using PubMed and MeSH. We used the same parameters as the original BioWordVec which has been thoroughly evaluated in a range of applications.

BioWordVec vector 13GB (200dim, trained on PubMed+MIMIC-III, word2vec bin format)
BioWordVec model 26GB (200dim, trained on PubMed+MIMIC-III)

We evaluated BioWordVec for medical word pair similarity. We used the MayoSRS (101 medical term pairs; download here) and UMNSRS_similarity (566 UMLS concept pairs; download here) datasets.

Model	MayoSRS	UMNSRS_similarity
word2vec	0.513	0.626
BioWordVec model	0.552	0.660

BioSentVec [2]: biomedical sentence embeddings with sent2vec

We applied sent2vec to compute the 700-dimensional sentence embeddings. We used the bigram model and set window size to be 20 and negative examples 10.

BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III)

We evaluated BioSentVec for clinical sentence pair similarity tasks. We used the BIOSSES (100 sentence pairs; download here) and the MedSTS (1068 sentence pairs; download here) datasets.

	BIOSSES	MEDSTS
Unsupervised methods
doc2vec	0.787	-
Levenshtein Distance	-	0.680
Averaged word embeddings	0.694	0.747
Universal Sentence Encoder	0.345	0.714
BioSentVec (PubMed)	0.817	0.750
BioSentVec (MIMIC-III)	0.350	0.759
BioSentVec (PubMed + MIMIC-III)	0.795	0.767
Supervised methods
Linear Regression	0.836	-
Random Forest	-	0.818
Deep learning + Averaged word embeddings	0.703	0.784
Deep learning + Universal Sentence Encoder	0.401	0.774
Deep learning + BioSentVec (PubMed)	0.824	0.819
Deep learning + BioSentVec (MIMIC-III)	0.353	0.805
Deep learning + BioSentVec (PubMed + MIMIC-III)	0.848	0.836

FAQ

You can find answers to frequently asked questions on our Wiki; e.g., you can find the instructions on how to load these models.

You can also find this tutorial on how to use BioSentVec for a quick start.

References

When using some of our pre-trained models for your application, please cite the following papers:

Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data. 2019.
Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. The 7th IEEE International Conference on Healthcare Informatics. 2019.

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. We are grateful to the authors of fastText, sent2vec, MayoSRS, UMNSRS, BIOSSES, and MedSTS for making their software and data publicly available.

biosentvec's People

Contributors

Stargazers

Watchers

Forkers

ghostintheshellarise f-dx formulaone9275 qshuang123 lucian-whu codeaudit xiaoyu-z taojin1992 chuanhong michelole mac-kim hamedmx beckwang80 gonzo-likes a-idoc nicole-he elswob raghavgoyal14 americast aspirincode qingyu-qc sannpeterson arvingong96 tonydeep giantfurosemide dragomirradev markshope statdataanalyzer aashaybhupendradoshi minghao2016 hafsah2018 henry-nlp hakanaku1234 leonkei eddiebarry austinbehan pwforks armon-chen ppvastar laomagic yaxche-io yoken-mao fyjgreatlion maxwanglei muluayele999 odnodn tilsonar akshayonly naveenjr nlp-kg albertpenny kchennen carterwsmith mohamedmkaouar thangasami yueyedeai qingwu11 rocke2020 shicheng-guo blackkakapo ftahabi utkarshupd rahulchavhan lcagnina itsnamgyu chenmicheal jeekim swatisaivarma elahehaghaarabi hungvo304ml danielphamvt hertera1 xutianhan 1zineb pandinosaurus xiaoheng-zhang99 renespijker srravula1 sharpboy2008 tmacmilan andrewgcodes vinay-ba shrurastogi adrianferenc tomakk boibash shitoudidi zoule41 tantantan12 duzida ssusantachary healthmemmo tsu-ke mrphil

biosentvec's Issues

gensim - read bin file?

Hi everyone,

thank you for making this resource public; it is a great help to the community.

I am having issues loading the .bin (model) file with gensim. Code snippet follows:

from gensim.models.fasttext import *
gensim_fasttext_model = load_facebook_vectors(root_path+"models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin")

Although the above snippet uses CPU/RAM resources, it never loads the model (seems it is constantly loading) nor does it produce an error.

When I try to load it with the fastText library, it loads it within 90ish seconds. Code snippet:

import fastText as fasttext
fasttext.load_model(root_path+'models/pretrained/BioWordVec_PubMed_MIMICIII_d200.bin')

Unfortunatelly, I would prefer to use the gensim approach as it enables the getitem to generate representations (e.g. model['word_to_represent']).

I can load the vec.bin file (pretrained word-embedding mapping) with
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('BioWordVec_PubMed_MIMICIII_d200.vec.bin', binary=True)

but that does not help me in my current pipeline (as its not dealing with OOVs).

Could you provide any help/guidance what might be wrong? My env ist:

Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.6.1810-Core
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.1.0
gensim 3.7.2
FAST_VERSION 1

Where can I download the word2vec and Sent2vec which are trained on PubMe dataset

I used your pretrained word embedding and sent embedding on PubMed+MIMIC-III. And I would like to conduct my experiments on only PubMed corpus.
So, Can you sent me the pretrained word embedding and sent embedding only PubMed corpus.
Thank you for your consideration.
I am looking forward to hearing from you.

AttributeError: module 'sent2vec' has no attribute 'Sent2vecModel'

I was trying to follow the given - BioSentVec_tutorial.ipynb

But after downloading the BioSentVec and trying to load I have got AttributeError.

BioWordVec - how to handle phrases

The MayoSRS and UMNSRS_similarity datasets mostly contain phrases. Did you use mean pooling to get the results you reported or some other pooling mechanism for ngrams longer than 1?

Thanks

Inconsistent window size

While both the paper and the README.md file mention window size of 20, the train_biowordvec.sh script uses -ws 30.

What was the final window size used to produce the models?

name 'stdvector_base' is not defined when calling sent2vec.Sent2vecModel.embed_sentence()

NameError Traceback (most recent call last)
in
3 for line in pos:
4 print(line)
----> 5 sentence_vector = model.embed_sentence(line)
6 pos_arrays[i] = sentence_vector
7 pos_labels[i] = 1

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentence()

src/sent2vec.pyx in sent2vec.Sent2vecModel.embed_sentences()

src/sent2vec.pyx in sent2vec.vector_wrapper.asarray()

NameError: name 'stdvector_base' is not defined

I went to the code of stdvector_base and only see "pass" in the class definition. I added the default constructor but doesn't help to resolve the issue. Any suggestion? This April, I could run it successfully but it cannot work now.

terminate called after throwing an instance of 'std::bad_alloc' in AWS EFS while loading BioSentVec

I built a Tensorflow model on top. of BioSentVec Emebeddinngs. Now when I am trying to deploy the model, I need BioSentevec at inference time to preprocess the inputs.

I am trying to deploy. the model on AWS. using Lambda and EFS.

I have mounted the lambda on EFS and getting the following error when I try to load. the model -

terminate called after throwing an instance of 'std::bad_alloc'

Here is the Stackoverflow issue I have created - https://stackoverflow.com/questions/63817981/terminate-called-after-throwing-an-instance-of-stdbad-alloc-in-aws-efs-while?noredirect=1#comment112852352_63817981

Can someone guide me as to what is happening? Is this due to 3 GB Ram limitation on Lambda?

IF that is so, so the only option to deploy a model that uses BioSentvec would be to use an EC2 instance?

PubMed corpus only trained model for BioSentVec

Hi all, thank for providing the embeddings for the BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III). Awesome work that helped me a lot!

I'd like to have the BioSentVec model trained only on the PubMed corpus. Did you train such a model too, or only the combined model with the corpora of PubMed+MIMIC-III?

I have tried the following model: BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III), but I have data from MIMIC-III in my test dataset and therefore a conflicting situation.

I appreciate an answer from you.

BIOSSES and MEDSTS results

Hi
Interesting paper and approach, however I am kind of confused on how to reproduce the results on both datasets. More importantly, the paper mentions (as far as I understand) using 5 layer deep neural network trained on the embedding generated by BioSentVec. Isnt the dataset size too small for deep networks and is it possible to share the training code.

MedSTS dataset

Where can I find the MEDSTS dataset? Also to evaluate the dataset on the BioSentVec model do I need any pre-processing?

Thanks
Abhishek

Medical antonyms

The following is not a bug in your code. Rather, wondering if any have thoughts on the following.
I'm working on some NLP tasks in oncology.
Have found that randomly initialized word/sentence embeddings tend to work better than any pretrained embeddings for ultimately classifying, say, improving vs worsening cancer.
Had an intuition that this might be because key words for telling the two apart tend to be embedded similarly.

In trying BioSentVec, this seems to be borne out, eg:
progression = model.embed_sentence(preprocess_sentence("Increase in size of tumor"))
response = model.embed_sentence(preprocess_sentence("Decrease in size of tumor"))
1 - distance.cosine(progression, response)

Yields: 0.94. Opposite meanings are embedded similarly, which explains why building classifiers on these embeddings does not work well.

Any methods for addressing this in the transfer learning setting that you're aware of? I have not found any---

How can I load the bio sent2vec model due to loss of RAM?

I just want to use the pretrained model of bio sent2vec in my local machine with almost 12 Gigabyte RAM, and the model size is almost 22 Gigabyte, and when I am trying to load model using sent2vec, it stuck and the result is nothing and the reason would be the model probably is completely loading and there is not enough memory.is there any way so that I can load the model under 12 GIgabyte RAM?

MedSTS

Where can I find the dataset?

will it work on windows

will it work on windows
since
https://github.com/epfml/sent2vec
seems to be only work on linux

also can be some files used without
https://github.com/epfml/sent2vec

meaning only use from fasttext?

"Model file cannot be opened for loading!" for BioSentVec

I have Python 3.8.5 and installed fasttext, as well as sent2vec. But when I try to load the model, python crashes with a single message "Model file cannot be opened for loading!". Full installation code is:

conda create -n sent2vec "python==3.8.5"
conda activate sent2vec
pip install Cython
git clone https://github.com/facebookresearch/fastText.git
git clone https://github.com/epfml/sent2vec.git
cd fastText
pip install .
cd ../sent2vec
pip install .

Then, python code is:

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model("./BioSentVec_PubMed_MIMICIII-bigram_d700.bin")

What could be wrong?

wiki page - Mention of BioWordVec instead of BioSentVec

@qingyu-chen

https://github.com/ncbi-nlp/BioSentVec/wiki#how-to-use-the-biowordvec-and-biosentvec-model

The BioWordVec is built upon sent2vec.

I guess, you meant to say BioSentVec.

Unable to handle negation of sentences

When I calculated similarity between "disease causing" and "not disease causing" using BioSentVec. It gave "1" , but I think it should give close to zer0. Kindly have a look at following:

from scipy.spatial import distance
sentence_vector1 = model.embed_sentence(preprocess_sentence("disease causing")  )
sentence_vector2 = model.embed_sentence(preprocess_sentence("not disease causing")  )
cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)
print( cosine_sim )  # this will print 1

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

Unable to import BioWord2Vec with KeyedVectors. Also tried with Word2Vec. Then it is giving deprecation warnings. Please help as soon as possible.

Make source corpora available

Would it be possible for you to make your source corpora available (both raw & preprocessed (tokenized / sentence split)? Would be very useful in helping folks create resources with other methods.

How to transfer learning or additional train with custom medical dataset

Hi,
This is a very good model for bio embedding, however, I need to add more train on my medical text dataset for further internal prediction. How can I do that?

Thanks

Context Vectors for words

Hello,

I wanted to use context embeddings for words to perform word similarity tasks. Is there a way to get the context vectors for words using the fastText model file.

Thanks,
Aditya

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

is this the RAM, GPU-memory or Hard disk memory issue?

Invalid words in vocabulary?

While exploring nearest neighbors, have seen many words which seem to be invalid words.

import fasttext
model = fasttext.load_model('./BioSentVec/models/BioWordVec_PubMed_MIMICIII_d200.bin')
model.get_nearest_neighbors('kidney', 30)

This gives the following output:

[(0.9160109162330627, u'kidney*'),
(0.9024562239646912, u'kidney=='),
(0.8989526033401489, u'kidneyks'),
(0.8850656747817993, u'kidney=5'),
(0.8817461133003235, u'kidney-kidney'),
(0.878646731376648, u'1kidney'),
(0.8774275183677673, u'2kidney'),
(0.87574702501297, u'kidney.6'),
(0.8753364682197571, u'qkidney'),
(0.8732652068138123, u'vkidney'),
(0.8732365369796753, u'kidney=48'),
(0.8726592659950256, u'kidneyx2'),
(0.8723018765449524, u'kidney.7'),
(0.8717607259750366, u'kidney2'),
(0.8697896003723145, u'kidneyys'),
(0.8693934679031372, u'kidneyl'),
(0.8692406415939331, u'kidneyds'),
(0.8683575987815857, u'lkidney'),
(0.8680075407028198, u'kidney*liver'),
(0.8666210174560547, u'kidney.5'),
(0.866155207157135, u'e1kidney'),
(0.8647593855857849, u'ckidney'),
(0.8646546006202698, u'ekidney'),
(0.8636531233787537, u'kidney~the'),
(0.861596941947937, u'kidney.2'),
(0.8599884510040283, u'dkidney'),
(0.8594629764556885, u'kidney.3'),
(0.8585801124572754, u'=kidney'),
(0.8581786155700684, u'vtkidney'),
(0.858029305934906, u'kidneywith')]

Wondering what are these words.
Are these coming from acupuncture points?
e.g. kidney2, kidney.2 - Do these represent http://www.acupuncture.com/education/points/kidney/kid2.htm ?

Even if that's the case, is it correct to generate words from the phrase kidney 2 ?
Or is that pre-processing wasn't done properly?

But when I use FastText's model, it returns expected nearest words:

model = fasttext.load_model('./models/cc.en.300.bin')
model.get_nearest_neighbors('kidney', 30)

[(0.7705090045928955, u'renal'),
(0.7571945786476135, u'kidneys'),
(0.7136564254760742, u'Kidney'),
(0.6960737109184265, u'kindey'),
(0.6932832598686218, u'liver'),
(0.6215611100196838, u'gallbladder'),
(0.6096128225326538, u'kidney-'),
(0.592450737953186, u'kidney-related'),
(0.5883890390396118, u'lung'),
(0.5875317454338074, u'Renal'),
(0.5851610898971558, u'kidney.'),
(0.580848753452301, u'dialysis'),
(0.5669795870780945, u'Kidneys'),
(0.565768301486969, u'pre-renal'),
(0.5617753267288208, u'hydronephrotic'),
(0.5602078437805176, u'non-renal'),
(0.5586943030357361, u'extra-renal'),
(0.557516872882843, u'ureter'),
(0.5568706393241882, u'hydronephrosis'),
(0.5556935667991638, u'nephrosis'),
(0.5507169961929321, u'extrarenal'),
(0.5478389859199524, u'bladder'),
(0.5455406904220581, u'nephritis'),
(0.540409505367279, u'pancreas'),
(0.538938045501709, u'gall-bladder'),
(0.5365235805511475, u'TEENney'),
(0.5338416695594788, u'pancreatic'),
(0.5323835611343384, u'ureteric'),
(0.5321975946426392, u'glomerular'),
(0.5308919548988342, u'prerenal')]

find similar sentences

Would be possible to find similar sentences using the sent2vec model?
How to use the BioSentVec model to query for similar sentences?

Question on calculate the similarity of UMNSRS?

Thank you for the great embedding. I have a few questions on how to calculate the similarity of UMNSRS.

Per my understanding, BioWordVec is a word embedding. Each word is represented to a vector. However, there are some phrases that contain more than one word in UMNSRS. Did you get the average of each word in the phrase and then calculate the cosine similarity?

One more question is how do you deal with the words that are not in the vocabulary? e.g. I found:

(ana)
arthriits
buterfly
varicsoe
haletosis

are not in the vocabulary. Did you impute something or just discard those terms.

One more question is I found the window size in .sh is 30. You describe you were using 20 for extrinsic task. Which one will yield a better result?

Thank you and looking forward to reply!

What is the exact vocabulary size for BioSentVec model 21GB (700dim, trained on PubMed+MIMIC-III)?

I also hope to know the vocabulary size for all the other two pretrained ones. Thank you.

Add generation code

It's really great that you have provided pretrained embeddings. For completeness, please also add the code written to generate them. It will serve as a useful technical example for someone to improve upon. Thanks.

Question: when you specify limit, does it start with most frequently found words?

When you load it like this, will these be the top 4E5 words found during training? I believe other vector bins like Google News work like this. Thank you!

word2vec = gensim.models.KeyedVectors.load_word2vec_format(
                'data/BioWordVec_PubMed_MIMICIII_d200.vec.bin',
                binary=True,
                limit=int(4E5) # faster load if you limit to most frequent terms?
            )

Months/Year of the PubMed corpus?

Hi,
Thanks for this very valuable resource. I would like to know the month/year of downloading of the PubMed corpus used to train the BioWordVec models? That is, articles in PubMed till what day/month/year were used to build the BioWordVec models?

Thank you,
Mani

Links to vec/model file wrong?

Hi,

thank you for sharing this with the rest of us! Its already coming in handy ;)

Just a minor issue (and not sure if it is one); did you by mistake link the "opposite" files to respective links:

BioWordVec vector 13GB (200dim, trained on PubMed+MIMIC-III, word2vec bin format) -> downloads the bin file, size 27GB

BioWordVec model 26GB (200dim, trained on PubMed+MIMIC-III) -> downloadds the vec.bin file, size 13GBs.

Best,
J.

How to import sent2vec

I want to use the pretrained BioSentVec to extract sentence vector. I am following the following codes and ran into the error: "no module named 'sent2vec'". Do you know how to resolve this?

import sent2vec
model = sent2vec.Sent2vecModel()
model.load_model('model.bin')
emb = model.embed_sentence("once upon a time .")

I have done the following steps:

download the pretrained model BioSentVec: BioSentVec_PubMed_MIMICIII-bigram_d700.bin
installed fasttext as below (fasttext was installed successfully)
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
$ unzip v0.9.1.zip
$ cd fastText-0.9.1
$ make
$pip install .