Giter Club home page Giter Club logo

sentence-transformers's People

Contributors

andrewkittredge avatar aphedges avatar arzelaascoii avatar bmaz avatar cpcdoy avatar fkdosilovic avatar fpgmaas avatar fros1y avatar guotong1988 avatar josemarcosrf avatar kddubey avatar kwang2049 avatar mauricesvp avatar michaelfeil avatar milistu avatar nikitajz avatar nimaboscarino avatar nreimers avatar omarespejel avatar osanseviero avatar philipmay avatar quetzalcohuatl avatar rafaelwo avatar sadakmed avatar sidhantls avatar sugatoray avatar swarajban avatar tomaarsen avatar zhenghongming888 avatar zoltan-fedor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sentence-transformers's Issues

Range of CosineSimilarity[-1,1] dose not match the label(0,1)

I find in the code, the textual semantic similarity tasks like sts, CosineSimilarityLoss first calculates the cosine distance between two vectors , and then uses mean squared error to calculate the loss.
However the range of cosine distance is [-1,1](<0:not similar, >0:similar) and the labels are in (0,1), it seems that at training step, the model tries to map space:[-1,1] to space[0,1]. And at the prediction step, >0.5 means similar, <0.5 not similar.
i do not know if i understand it right

Using sentence transformers for transforming words with word-windows?

I've written an Extractive Summarizer called CX_DB8 which utilizes pretrained word-embedding models to summarize/semantically-search documents. It works at the word, sentence or paragraph level, and supports any pretrained model available with pytorch-transformers or offered via the Flair AI package.

My question is this: Is "sentence-transformers" suitable for training / fine-tuning with say, 10 word sliding word-windows? What about Paragraph sized texts? Are the pretrained models offered here suitable to run word-windows through them without any fine-tuning? What do you think about utilizing these sentence/word-window embeddings with the PageRank algorithm?

Question about softmaxloss

Hi,
This is a wonderful project. But I have a question about softmaxloss. If you train with softmax loss and when you evaluate the model, you still need the classifier layer. It means that finding in a collection of n = 10 000 sentences the pair with the highest similarity requires, you still need nยท(nโˆ’1)/2 = 49 995 000 inference computations, right? Because you still need the weights in classifier layer.
I believe that in order to use cosine similarity to evaluate the model, you have to train with cosine similarityloss. Is it possible to use cosine similarity to evaluate the model after training with softmaxloss?
Thanks

Training 2 distinct models to map different Inputs to Shared Vector Space

Hi -
As I understand it, you choose to utilize the same model to encode both sentences, in lets say the NLI task. From cosine loss impl:

def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
  reps = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]

In this case, since the first sentence represents a judgement and the later a hypothesis, would it make sense to train two separate models simultaneously to map onto the same vector space?

Is this already what's happening and I am confused? Or is this something you've explored already.

For example something like:

def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
  rep_a = self.model_a(sentence_features[0])['sentence_embedding']
  rep_b = self.model_b(sentence_features[1])['sentence_embedding']
  output = torch.cosine_similarity(rep_a, rep_b)

Thank you
Jack

Question: Why train from scratch?

First, thank you very much for your work on sentence-transformers. I can't wait to start using it and really digging into it to understand everything.

I have a very naive question. Why would someone want to train a model from scratch? I can see fine-tuning a pre-trained model on a dataset that is, say, more representative of the vocabulary for your task. But what would be the various reasons for wanting to train, say, bert-base-uncased from scratch?

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead

Everything was just fine before I reinstalled my anaconda to the latest version(2019.07)(or it could be the update to 0.2.2?), but now when I run the code it gives this Runtime Error:

File "D:\anaconda\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 136, in encode
embeddings = self.forward(features)
File "D:\anaconda\lib\site-packages\torch\nn\modules\container.py", line 92, in forward
input = module(input)
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\sentence_transformers\models\BERT.py", line 27, in forward
output_tokens = self.bert(input_ids=features['input_ids'], token_type_ids=features['token_type_ids'], attention_mask=features['input_mask'])[0]
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\pytorch_transformers\modeling_bert.py", line 712, in forward
embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\pytorch_transformers\modeling_bert.py", line 264, in forward
words_embeddings = self.word_embeddings(input_ids)
File "D:\anaconda\lib\site-packages\torch\nn\modules\module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "D:\anaconda\lib\site-packages\torch\nn\modules\sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "D:\anaconda\lib\site-packages\torch\nn\functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)

I've reinstalled pytorch and pytorch-transformers, but it doesn't help. Here is my simple code:

model = SentenceTransformer("./model/")
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

where model is bert-base-nli-mean-tokens.

Any help would be appreciated! Thanks a lot!

fp16 Opt level typo?

Hi -

RuntimeError: Unexpected optimization level 01. Options are 'O0', 'O1', 'O2', 'O3'.  Note that in `O0`, `O1`, etc., the prefix O is the letter O, not the number zero.

see NVIDIA/apex#339 ๐Ÿ˜“

in SentenceTransformer.fit (default kwargs).

It seems the official BERT file can not be load.

The model file in https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.1/
is Ok.

I download the official BERT file from https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py#L36

Traceback (most recent call last):
  File "C:/Users/gt/Desktop/sentence-transformers-master/examples/training_nli.py", line 42, in <module>
    model = SentenceTransformer(model_name_or_path='bert-base-uncased/',sentence_transformer_config=sentence_transformer_config)
  File "C:\Users\gt\Desktop\sentence-transformers-master\sentence_transformers\SentenceTransformer.py", line 90, in __init__
    self.transformer_model.load_state_dict(torch.load(output_model_file, map_location='cuda' if torch.cuda.is_available() else 'cpu'))
  File "C:\Python36\lib\site-packages\torch\nn\modules\module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERT:
	Missing key(s) in state_dict: 

Feature Request: DistilBERT

Hi,

Encoding large corpora takes really long time using current pretrained models. Could you please add a light model like DistilBERT especially for STS tasks.

Thank you!

Question: How to evaluate models

Suppose I have a data set representative of my particular domain that I use to fine-tune a pretrained model on. What would be a sensible procedure to go about evaluating the two models? Said differently, how would I know if fine-tuning was worth it?

Getting bad results

I was trying this solution as a question to question matcher. But on my data the results are very bad. Do you think finetune would help ?

Performance of the pretrained model

I ran the following command:

python examples/evaluation_stsbenchmark.py

And I got the following results:

2019-11-06 09:47:12 - Cosine-Similarity : Pearson: 0.7415 Spearman: *0.7698
2019-11-06 09:47:12 - Manhattan-Distance: Pearson: 0.7730 Spearman: 0.7712
2019-11-06 09:47:12 - Euclidean-Distance: Pearson: 0.7713 Spearman: 0.7707
2019-11-06 09:47:12 - Dot-Product-Similarity: Pearson: 0.7273 Spearman: 0.7270

I'm confused because you reported the best performance is 77.12 for cosine-similarity and spearman. According to the results above, it's 76.98. Please correct me if I'm wrong.

README, Model Training from Scratch

You mention models variable, but it is nowhere instantiated. What is this variable?

# Use BERT for mapping tokens to embeddings
word_embedding_model = models.BERT('bert-base-uncased')

Roberta Tokenizer too many SEP?

In https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/RoBERTa.py at get_sentence_features() two SEP tokens are added to the input.

According to the HuggingFace RobertaTokenizer (https://huggingface.co/pytorch-transformers/_modules/pytorch_transformers/tokenization_roberta.html#RobertaTokenizer) only one SEP token is added if you encode one sentence with add_special_tokens_single_sentence().
However, add_special_tokens_sentences_pair has two SEP tokens between the sentences.

Which format is correct? Does it even matter what format you use?

error in LabelAccuracyEvaluator.py

The codes in line 53 in LabelAccuracyEvaluator.py :
_, prediction = model(features[0])
It does not work. When I run this code,error occurs.

Some checks on training with numpy

Hi, I'm training a STS using this code but over my domain data:

I'm getting these warnings:

/numpy/lib/function_base.py:2534: RuntimeWarning: invalid value encountered in true_divide
c /= stddev[:, None]
/scipy/stats/_distn_infrastructure.py:901: RuntimeWarning: invalid value encountered in greater
return (a < x) & (x < b)
/scipy/stats/_distn_infrastructure.py:901: RuntimeWarning: invalid value encountered in less
return (a < x) & (x < b)
/scipy/stats/_distn_infrastructure.py:1892: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= _a)
/numpy/lib/function_base.py:2535: RuntimeWarning: invalid value encountered in true_divide

Then the similarities on all the epochs are computed:

Cosine-Similarity : Pearson: nan Spearman: nan
2019-08-30 14:35:20 - Manhattan-Distance: Pearson: nan Spearman: nan
2019-08-30 14:35:20 - Euclidean-Distance: Pearson: nan Spearman: nan
2019-08-30 14:35:20 - Dot-Product-Similarity: Pearson: nan Spearman: nan

The examples with the STSbenchmark works Very Good! I'm just changing the train, dev, set files, I couldn't train on this data, could be associated with the vocabulary of the word embeddings? maybe that it could't contains some words from my corpus.

Bes regards

Custom Vocabulary

Hello
My Problem based on product title similarity.
General vocabulary don't fit my case
How i can change vocabulary

Thanks

Downloading Pre-Trained Model Failed

The following instruction works well in Google Colab: model = SentenceTransformer('bert-base-nli-stsb-mean-tokens')

However, when running locally in Jupyter Lab on Windows, I get:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\a324448/.cache\\torch\\sentence_transformers\\public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\\modules.json'

The folder public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_bert-base-nli-mean-tokens.zip\modules.zip' is empty on my Window machine.

Can you provide a link where I could download the model?

Note: Correcting C:\\Users\\a324448/.cache\\torch... to C:\\Users\\a324448\\\.cache\\torch... did not help.

BTW: Great work with the Repo.

Fine Tuning for Non-English: Dataset for Clustering Task

Hi,
I have few questions related to models?

  1. Does bert-base-nli-mean-tokens is trained on English-only dataset? I have used this model to get the embeddings of Urdu Language - sentences. It is producing the embeddings of sentences. However, the embeddings are of low quality.

  2. I want to train a Sentence transformer for Urdu Language. The intended task is to perform Clustering. Which type of dataset you suggest for fine tuning, if I train my model using Bert Multilingual Model.

Custom Embeddings and Tokenizer

Hello, some times you just start digging around github and you find a library that's just awesome, I think this one it's one of theses cases, before anything I would like to thank you for the great work!

On the technical side, will be possible to use custom word embeddings in the WordEmbeddings class together with a custom tokenizer? I'm thinking on using BPEmb embeedings as base word embeddings which perform very good for similarity tasks.

SentenceTransformer partially ignores 'device' attribute

Hi, thanks for awesome research!)
I'm currently experimenting with large-scale similarity search. I'm using our institute's shared machine with two 2080ti gpus. Sometimes, gpu-0 is used by someone else, so i've loaded SentenceTransformer with device argument "cuda:1". I've noticed(via gpustat) that while model is correctly loaded into gpu-1 memory(consuming around 980+ mb)
image
calling model.encode() leads to allocating small amount of memory on the gpu-0
image
It is not big problem for now, but it looks like incorrect behavior.

Using NLI dataset for as Development dataset

Hi, I am fine tuning training_nli_bert.py script using bert-multilingual model for generating sentence embeddings in Urdu language. I have only NLI dataset available for training and evaluation.
My question is that can we use NLI dev-dataset in place of STS dataet for evaluation purpose?
Are there any cons of using NLI dataset on the quality of sentence embeddings?
What changes I need to made in the following code

logging.info("Read STSbenchmark dev dataset")
dev_data = SentencesDataset(examples=sts_reader.get_examples('sts-dev.csv'), model=model)
dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=batch_size)
evaluator = EmbeddingSimilarityEvaluator(dev_dataloader)

@nreimers please help

Is it possible to encode by using multi-GPU?

Thanks for this beautiful package, it saves a lot of work to do semantic embedding. I am running a large size data base trying to transform docs into embedding matrix. When I was running with the code, it seemed only using single GPU to encode the sentence. Is there any way that I could do this by multi-GPU?

Unexpected high similarity

I am using bert-base-nli-stsb-mean-tokens model in an unsupervised fashion to get similarity between sentences.
It performs really good for some cases.
But on doing extensive analysis, I found some cases where such high score for similarity makes no sense.

I am trying to figure out why the similarity is so high for cases where sentences are extremely short or make no sense at all
What is really happening here?
Any leads would be helpful.

Thanks in advance,
for your reference
Screenshot 2019-08-22 at 6 37 33 PM

sentence bert model in onnx format

I would like to convert sentence bert model from pytorch to tensorflow use onnx, and tried to follow the standard onnx procedure for converting a pytorch model. But I'm having difficulty determining the onnx input arguments for sentence bert model, I encounter TypeError: forward() takes 2 positional arguments but 4 were given. Suggestions appreciated!
model = SentenceTransformer('output/continue_training_model')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
dummy_input0 = torch.LongTensor(batch_size, max_seq_length).to(device)
dummy_input1 = torch.LongTensor(batch_size, max_seq_length).to(device)
dummy_input2 = torch.LongTensor(batch_size, max_seq_length).to(device)
torch.onnx.export(model,(dummy_input0, dummy_input1,dummy_input2), onnx_file_name, verbose=True)

Why is it a continual training on NLI For STS Task

I fail to understand the reason behind the continual training needed.
Why could the Bert model be not fine tuned end to end for semantic similarity using STS benchmark dataset only.

It is just because the data is less or is there some other leverage that continual training on NLI provides us?

parallel embedding can't be done

Hi @fhaase2 i have created a service by using your sentence embedding model. However when parallel request is coming on server encoding ( embedder.encode ) is failing.

Any pointers for training other language other than english?

For training other language other than english, can I train on top of the english pretrained model? How much corpus do I need roughly to get a good fidelity? How do I train for product listing like scenario (max 50 words)? How important is having the dataset with representative vocabularies (ie if we are doing semantic search can it recover well with missing vocab?)

Multiple GPU support

First, I just wanted to say that this repo is fantastic. It's very useful and powerful for training a model to detect sentence similarity.

A question I have: are there any plans to support multi-GPU training? If not, could you recommend any examples in other repos that show how to implement this? Thank you,

Kevin

reproducing the paper's best results

I've tried to replicate the paper. For bert-base-nli-mean-tokens, the model which was trained from scratch with your code reached 74.71 of cosine-similarity on the sts-test set. It is way too low compared to the score on the paper. Any thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.