ukplab / sentence-transformers Goto Github PK

View Code? Open in Web Editor NEW

14.7K 14.7K 2.4K 27.29 MB

Multilingual Sentence & Image Embeddings with BERT

Home Page: https://www.SBERT.net

License: Apache License 2.0

Python 99.94% Makefile 0.06%

sentence-transformers's People

Contributors

Stargazers

Watchers

Forkers

fros1y see-u-see vseledkin burakakrishna amenityllc 0xsameer allensmile legendtianjin shannonyu yzsun dhairyadalal xcgfth guanlongtianzi fishredleaf liusiyi641 chenken19861025 gabbage ericbk slidersun yutaoxxx dainis-boumber jannesgg hesamalian amir22010 lunayach anakteka knuser grzegorzwarzecha xf05888 cyprestar soaxelbrooke xy-1 fhaase2 ceshine gitgirish2 nkchensir rock3125 21jun leesehoon lucien-qiang koursaros-ai boyuezhong ssitb subhadarship starshipai undarmaa hoangdzung lppier lawiss hengyangds huskyeder fishexpert githubmyk sarwar187 james-bao dstaka kiranvarghesev patriciaxiao pidugusundeep johngiorgi sduchh bhk9585 sanerzheng dadelani somefive jiangnanyida beesitech matthias-samwald intuitionmachine justbbused joon-park92 syedrz arita37 iambothq dunovank shihuaxing liangsli qianrenjian tkhan3 snakeztc zhesun821 billy322 ankitamandal ammarshadiq renshuhuai-andy ameliecomte muratsensoy vicsev rhp62 shasha79 dertilo ansontgn garcer3 nicole-he qfxlcyc vitekzach bmaz nachiketaa chiragsanghvi10 cbiehl

sentence-transformers's Issues

Range of CosineSimilarity[-1,1] dose not match the label(0,1)

I find in the code, the textual semantic similarity tasks like sts, CosineSimilarityLoss first calculates the cosine distance between two vectors , and then uses mean squared error to calculate the loss.
However the range of cosine distance is [-1,1](<0:not similar, >0:similar) and the labels are in (0,1), it seems that at training step, the model tries to map space:[-1,1] to space[0,1]. And at the prediction step, >0.5 means similar, <0.5 not similar.
i do not know if i understand it right

Using sentence transformers for transforming words with word-windows?

I've written an Extractive Summarizer called CX_DB8 which utilizes pretrained word-embedding models to summarize/semantically-search documents. It works at the word, sentence or paragraph level, and supports any pretrained model available with pytorch-transformers or offered via the Flair AI package.

My question is this: Is "sentence-transformers" suitable for training / fine-tuning with say, 10 word sliding word-windows? What about Paragraph sized texts? Are the pretrained models offered here suitable to run word-windows through them without any fine-tuning? What do you think about utilizing these sentence/word-window embeddings with the PageRank algorithm?

Question about softmaxloss

Hi,
This is a wonderful project. But I have a question about softmaxloss. If you train with softmax loss and when you evaluate the model, you still need the classifier layer. It means that finding in a collection of n = 10 000 sentences the pair with the highest similarity requires, you still need n·(n−1)/2 = 49 995 000 inference computations, right? Because you still need the weights in classifier layer.
I believe that in order to use cosine similarity to evaluate the model, you have to train with cosine similarityloss. Is it possible to use cosine similarity to evaluate the model after training with softmaxloss?
Thanks

Training 2 distinct models to map different Inputs to Shared Vector Space

Hi -
As I understand it, you choose to utilize the same model to encode both sentences, in lets say the NLI task. From cosine loss impl:

def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
  reps = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]

In this case, since the first sentence represents a judgement and the later a hypothesis, would it make sense to train two separate models simultaneously to map onto the same vector space?

Is this already what's happening and I am confused? Or is this something you've explored already.

For example something like:

def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
  rep_a = self.model_a(sentence_features[0])['sentence_embedding']
  rep_b = self.model_b(sentence_features[1])['sentence_embedding']
  output = torch.cosine_similarity(rep_a, rep_b)

Thank you
Jack

The understanding of training_stsbenchmark..

The two sentence are NOT concat and input into BERT to get two representation.
And then use cosine loss to train.
Am I right?

I read

and

Am I right?

Confused on training_nli

Train and dev data are from different datasets.

Getting embeddings from different layers

Is it possible to get embeddings from different layers? How to do so?

Question: Why train from scratch?

First, thank you very much for your work on sentence-transformers. I can't wait to start using it and really digging into it to understand everything.

I have a very naive question. Why would someone want to train a model from scratch? I can see fine-tuning a pre-trained model on a dataset that is, say, more representative of the vocabulary for your task. But what would be the various reasons for wanting to train, say, bert-base-uncased from scratch?

For doc ranking task, which loss should I use?

Given a query, rank the doc.

I guess MULTIPLE_NEGATIVES_RANKING_LOSS.

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead

Everything was just fine before I reinstalled my anaconda to the latest version(2019.07)(or it could be the update to 0.2.2?), but now when I run the code it gives this Runtime Error:

File "D:\anaconda\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 136, in encode
embeddings = self.forward(features)
File "D:\anaconda\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
input = module(input)
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\sentence_transformers\models\BERT.py", line 27, in forward
output_tokens = self.bert(input_ids=features['input_ids'], token_type_ids=features['token_type_ids'], attention_mask=features['input_mask'])[0]
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\pytorch_transformers\modeling_bert.py", line 712, in forward
embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\pytorch_transformers\modeling_bert.py", line 264, in forward
words_embeddings = self.word_embeddings(input_ids)
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\torch\nn\modules\sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "D:\anaconda\lib\site-packages\torch\nn\functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)

I've reinstalled pytorch and pytorch-transformers, but it doesn't help. Here is my simple code:

model = SentenceTransformer("./model/")
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

where model is bert-base-nli-mean-tokens.

Any help would be appreciated! Thanks a lot!

fp16 Opt level typo?

Hi -

RuntimeError: Unexpected optimization level 01. Options are 'O0', 'O1', 'O2', 'O3'.  Note that in `O0`, `O1`, etc., the prefix O is the letter O, not the number zero.

see NVIDIA/apex#339 😓

in SentenceTransformer.fit (default kwargs).

Have you eval the performance of this embeddings vs BM25 in short sentence ranking task?

I have train a BERT-base model in Quora Pair Dataset, which is a text similarity task.
It seems the BM25 result is about 56% in top 1 ranking accuracy in test dataset of 40000 sentence.
But this sentence embeddings is about only 50% in top 1 ranking accuracy.

Link docs/training.md is broken

The link for training is broken docs/training.md.

What is the classical loss for doc ranking problem? Thank you.

Based on my understanding, Multiple Negatives Ranking Loss is a better loss for doc ranking problem.
What is the former classical loss for doc ranking problem?
Thank you very much.

It seems the official BERT file can not be load.

The model file in https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.1/
is Ok.

I download the official BERT file from https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py#L36

Traceback (most recent call last):
  File "C:/Users/gt/Desktop/sentence-transformers-master/examples/training_nli.py", line 42, in <module>
    model = SentenceTransformer(model_name_or_path='bert-base-uncased/',sentence_transformer_config=sentence_transformer_config)
  File "C:\Users\gt\Desktop\sentence-transformers-master\sentence_transformers\SentenceTransformer.py", line 90, in __init__
    self.transformer_model.load_state_dict(torch.load(output_model_file, map_location='cuda' if torch.cuda.is_available() else 'cpu'))
  File "C:\Python36\lib\site-packages\torch\nn\modules\module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERT:
	Missing key(s) in state_dict:

Example of fine-tuning BERT with triplet loss

Thanks so much for your great work! Just wanted to put in a quick request: I'd really appreciate example code for fine-tuning BERT with triplet loss.

Do you change weights of word embedding model during training?

Feature Request: DistilBERT

Hi,

Encoding large corpora takes really long time using current pretrained models. Could you please add a light model like DistilBERT especially for STS tasks.

Thank you!

Question: How to evaluate models

Suppose I have a data set representative of my particular domain that I use to fine-tune a pretrained model on. What would be a sensible procedure to go about evaluating the two models? Said differently, how would I know if fine-tuning was worth it?

Getting bad results

I was trying this solution as a question to question matcher. But on my data the results are very bad. Do you think finetune would help ?

Performance of the pretrained model

I ran the following command:

python examples/evaluation_stsbenchmark.py

And I got the following results:

2019-11-06 09:47:12 - Cosine-Similarity : Pearson: 0.7415 Spearman: *0.7698
2019-11-06 09:47:12 - Manhattan-Distance: Pearson: 0.7730 Spearman: 0.7712
2019-11-06 09:47:12 - Euclidean-Distance: Pearson: 0.7713 Spearman: 0.7707
2019-11-06 09:47:12 - Dot-Product-Similarity: Pearson: 0.7273 Spearman: 0.7270

I'm confused because you reported the best performance is 77.12 for cosine-similarity and spearman. According to the results above, it's 76.98. Please correct me if I'm wrong.

README, Model Training from Scratch

You mention models variable, but it is nowhere instantiated. What is this variable?

# Use BERT for mapping tokens to embeddings
word_embedding_model = models.BERT('bert-base-uncased')

Roberta Tokenizer too many SEP?

In https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/RoBERTa.py at get_sentence_features() two SEP tokens are added to the input.

According to the HuggingFace RobertaTokenizer (https://huggingface.co/pytorch-transformers/_modules/pytorch_transformers/tokenization_roberta.html#RobertaTokenizer) only one SEP token is added if you encode one sentence with add_special_tokens_single_sentence().
However, add_special_tokens_sentences_pair has two SEP tokens between the sentences.

Which format is correct? Does it even matter what format you use?

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

load pretrained weights

model = SentenceTransformer('bert-base-nli-mean-tokens')

got this error No module named 'fused_layer_norm_cuda' .
Is it because I am using CPU?

error in LabelAccuracyEvaluator.py

The codes in line 53 in LabelAccuracyEvaluator.py :
_, prediction = model(features[0])
It does not work. When I run this code,error occurs.

Some checks on training with numpy

Hi, I'm training a STS using this code but over my domain data:

I'm getting these warnings:

/numpy/lib/function_base.py:2534: RuntimeWarning: invalid value encountered in true_divide
c /= stddev[:, None]
/scipy/stats/_distn_infrastructure.py:901: RuntimeWarning: invalid value encountered in greater
return (a < x) & (x < b)
/scipy/stats/_distn_infrastructure.py:901: RuntimeWarning: invalid value encountered in less
return (a < x) & (x < b)
/scipy/stats/_distn_infrastructure.py:1892: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= _a)
/numpy/lib/function_base.py:2535: RuntimeWarning: invalid value encountered in true_divide

Then the similarities on all the epochs are computed:

Cosine-Similarity : Pearson: nan Spearman: nan
2019-08-30 14:35:20 - Manhattan-Distance: Pearson: nan Spearman: nan
2019-08-30 14:35:20 - Euclidean-Distance: Pearson: nan Spearman: nan
2019-08-30 14:35:20 - Dot-Product-Similarity: Pearson: nan Spearman: nan

The examples with the STSbenchmark works Very Good! I'm just changing the train, dev, set files, I couldn't train on this data, could be associated with the vocabulary of the word embeddings? maybe that it could't contains some words from my corpus.

Bes regards

Custom Vocabulary

Hello
My Problem based on product title similarity.
General vocabulary don't fit my case
How i can change vocabulary

Thanks

Do you plan to make CenterLoss loss?

Downloading Pre-Trained Model Failed

The following instruction works well in Google Colab: model = SentenceTransformer('bert-base-nli-stsb-mean-tokens')

However, when running locally in Jupyter Lab on Windows, I get:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\a324448/.cache\\torch\\sentence_transformers\\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\\modules.json'

The folder public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\modules.zip' is empty on my Window machine.

Can you provide a link where I could download the model?

Note: Correcting C:\\Users\\a324448/.cache\\torch... to C:\\Users\\a324448\\\.cache\\torch... did not help.

BTW: Great work with the Repo.

Fine Tuning for Non-English: Dataset for Clustering Task

Hi,
I have few questions related to models?

Does bert-base-nli-mean-tokens is trained on English-only dataset? I have used this model to get the embeddings of Urdu Language - sentences. It is producing the embeddings of sentences. However, the embeddings are of low quality.
I want to train a Sentence transformer for Urdu Language. The intended task is to perform Clustering. Which type of dataset you suggest for fine tuning, if I train my model using Bert Multilingual Model.

Custom Embeddings and Tokenizer

Hello, some times you just start digging around github and you find a library that's just awesome, I think this one it's one of theses cases, before anything I would like to thank you for the great work!

On the technical side, will be possible to use custom word embeddings in the WordEmbeddings class together with a custom tokenizer? I'm thinking on using BPEmb embeedings as base word embeddings which perform very good for similarity tasks.

When run application, max_seq_len should be set.

If we train the model with 512 max_seq_len, but encode a sentence with 513 tokens, then error.

SentenceTransformer partially ignores 'device' attribute

Hi, thanks for awesome research!)
I'm currently experimenting with large-scale similarity search. I'm using our institute's shared machine with two 2080ti gpus. Sometimes, gpu-0 is used by someone else, so i've loaded SentenceTransformer with device argument "cuda:1". I've noticed(via gpustat) that while model is correctly loaded into gpu-1 memory(consuming around 980+ mb)

calling model.encode() leads to allocating small amount of memory on the gpu-0

It is not big problem for now, but it looks like incorrect behavior.

Using NLI dataset for as Development dataset

Hi, I am fine tuning training_nli_bert.py script using bert-multilingual model for generating sentence embeddings in Urdu language. I have only NLI dataset available for training and evaluation.
My question is that can we use NLI dev-dataset in place of STS dataet for evaluation purpose?
Are there any cons of using NLI dataset on the quality of sentence embeddings?
What changes I need to made in the following code

logging.info("Read STSbenchmark dev dataset")
dev_data = SentencesDataset(examples=sts_reader.get_examples('sts-dev.csv'), model=model)
dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=batch_size)
evaluator = EmbeddingSimilarityEvaluator(dev_dataloader)

@nreimers please help

Did you try to use multigpu for training?

Is it possible to encode by using multi-GPU?

Thanks for this beautiful package, it saves a lot of work to do semantic embedding. I am running a large size data base trying to transform docs into embedding matrix. When I was running with the code, it seemed only using single GPU to encode the sentence. Is there any way that I could do this by multi-GPU?

ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

After pip installing and trying to import SentenceTransformer I get this error:
ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

When I look into the source code the only folder I have is models. I am missing evaluation, etc.
Any Idea why?

Unexpected high similarity

I am using bert-base-nli-stsb-mean-tokens model in an unsupervised fashion to get similarity between sentences.
It performs really good for some cases.
But on doing extensive analysis, I found some cases where such high score for similarity makes no sense.

I am trying to figure out why the similarity is so high for cases where sentences are extremely short or make no sense at all
What is really happening here?
Any leads would be helpful.

Thanks in advance,
for your reference

error while downloading

I am getting the above error.
OS: windows 10
Python 3.6

How to use LabelAccuracyEvaluator.py

Hi, I'd like to evaluate the accuracy of AllNLI task during training. Could you kindly provide an example for that? Thank you!

unexpected low similarity

sentence bert model in onnx format

I would like to convert sentence bert model from pytorch to tensorflow use onnx, and tried to follow the standard onnx procedure for converting a pytorch model. But I'm having difficulty determining the onnx input arguments for sentence bert model, I encounter TypeError: forward() takes 2 positional arguments but 4 were given. Suggestions appreciated!
model = SentenceTransformer('output/continue_training_model')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
dummy_input0 = torch.LongTensor(batch_size, max_seq_length).to(device)
dummy_input1 = torch.LongTensor(batch_size, max_seq_length).to(device)
dummy_input2 = torch.LongTensor(batch_size, max_seq_length).to(device)
torch.onnx.export(model,(dummy_input0, dummy_input1,dummy_input2), onnx_file_name, verbose=True)

Why is it a continual training on NLI For STS Task

I fail to understand the reason behind the continual training needed.
Why could the Bert model be not fine tuned end to end for semantic similarity using STS benchmark dataset only.

It is just because the data is less or is there some other leverage that continual training on NLI provides us?

parallel embedding can't be done

Hi @fhaase2 i have created a service by using your sentence embedding model. However when parallel request is coming on server encoding ( embedder.encode ) is failing.

Any pointers for training other language other than english?

For training other language other than english, can I train on top of the english pretrained model? How much corpus do I need roughly to get a good fidelity? How do I train for product listing like scenario (max 50 words)? How important is having the dataset with representative vocabularies (ie if we are doing semantic search can it recover well with missing vocab?)

How to get the Avg. BERT embeddings?

I have read your paper, and I want to know which layer of bert embedding used in your code.
Many people have said that the last layer is not the reasonalbe representation.
And this project recommends the last 2 layer embedding to averge, https://github.com/hanxiao/bert-as-service.

Can pretrained models be used to get embedding of german sentences ?

Hi,

Can pre-trained models be used to get embedding of german sentences (end goal is to cluster sentences) or does it only make sense to use them with English input? If the latter is the case, are you planning to release pretrained models for german as well?

Multiple GPU support

First, I just wanted to say that this repo is fantastic. It's very useful and powerful for training a model to detect sentence similarity.

A question I have: are there any plans to support multi-GPU training? If not, could you recommend any examples in other repos that show how to implement this? Thank you,

Kevin

reproducing the paper's best results

I've tried to replicate the paper. For bert-base-nli-mean-tokens, the model which was trained from scratch with your code reached 74.71 of cosine-similarity on the sts-test set. It is way too low compared to the score on the paper. Any thoughts?

Why do you need random initiated linear transformer?

sentence-transformers/sentence_transformers/losses/SoftmaxLoss.py

Line 30 in 3fcae03

 self.classifier = nn.Linear(num_vectors_concatenated * sentence_embedding_dimension, num_labels) 

I don't understand why you create this and than just use it like transformater from one feature space to final result space.