andriymulyar / semantic-text-similarity Goto Github PK

an easy-to-use interface to fine-tuned BERT models for computing semantic similarity in clinical and web text. that's it.

License: MIT License

Python 100.00%

bert bert-model semantic-similarity clinical-semantic-similarity sts-b med-sts sentence-similarity

semantic-text-similarity's Introduction

semantic-text-similarity

an easy-to-use interface to fine-tuned BERT models for computing semantic similarity. that's it.

This project contains an interface to fine-tuned, BERT-based semantic text similarity models. It modifies pytorch-transformers by abstracting away all the research benchmarking code for ease of real-world applicability.

Model	Dataset	Dev. Correlation
Web STS BERT	STS-B	0.893
Clinical STS BERT	MED-STS	0.854

Installation

Install with pip:

pip install semantic-text-similarity

or directly:

pip install git+https://github.com/AndriyMulyar/semantic-text-similarity

Use

Maps batches of sentence pairs to real-valued scores in the range [0,5]

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity

web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction

clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

More examples.

Notes

You will need a GPU to apply these models if you would like any hint of speed in your predictions.
Model downloads are cached in ~/.cache/torch/semantic_text_similarity/. Try clearing this folder if you have issues.

Acknowledgement

Clinical models in this project were submitted to the 2019 N2C2 Shared Task Track 1. Implementation and model training in this project was supported by funding from the Mark Dredze Lab at Johns Hopkins University.

semantic-text-similarity's People

Contributors

Stargazers

Watchers

semantic-text-similarity's Issues

dataset

Hi,

I attend the 2019 n2c2 task3, and I found your model is useful om my data, and I'm keeping on study of task 3, so can you give me a way to MedSTS dataset? thank you very much!

Wen.

PermissionError: [Errno 13] Permission denied:

Hello,

When I launch your example, the program begins by download we_bert_similarity.tar.gz, and at the end of the download I have the following error:
Failed to download model: web-bert-similarity Traceback (most recent call last): File "semantic_similarity.py", line 4, in <module> web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\site-packages\semantic_text_similarity\models\bert\web_similarity.py", line 7, in __init__ model_path = get_model_path(model_name) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\site-packages\semantic_text_similarity\models\util.py", line 40, in get_model_path raise exc File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\site-packages\semantic_text_similarity\models\util.py", line 36, in get_model_path tar = tarfile.open(temp_file.name) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\tarfile.py", line 1573, in open return func(name, "r", fileobj, **kwargs) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\tarfile.py", line 1638, in gzopen fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\gzip.py", line 163, in __init__ fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') PermissionError: [Errno 13] Permission denied: 'C:\\Users\\rachidj\\AppData\\Local\\Temp\\tmpq6i6wt_s'

So I downloaded them manually, but I don't know where put them to make the example script working

Loading of WebBertSimilarity model breaks

Hello,

While using this repo, I noticed that the web-bert-similarity model fails to load. After some digging, I realized that the authors of hugging-face/pytorch-transformers have released a new version i.e. 1.2.0 resulting in your code breaking at line 27 of /models/bert/similarity.py.
I was able to fix it by enforcing the older version of pytorch-transformers (1.1.0) and thought that I should give you heads up about the same in case you weren't aware.

Thanks and good luck!

problem when swapped sentences

I load the model and try to predict the similarity between sentence A and sentence B. when I change the order of these sentences (i change the place of sentence A and B and swapped them), i get different prediction value for similarity. Why this happen?is not this measure symmetric?

Is BERT actually good for semantic similarity.. check the example

As per score below , second case is more semantic similar than other one. But in actual it is just the opposite.

>>> model.predict([("he is an indian", "he has indian citizenship")])
array([3.2054904], dtype=float32)
>>> model.predict([("he is an indian", "he is  not an indian")])
array([3.590286], dtype=float32)

Where to find similarity.normalized_levenshtein

i can not install similarity with pip3, and i don't find any code related to string_similarity_features function, does this function really matter?
3q

RuntimeError: The expanded size of the tensor (11) must match the existing size (15) at non-singleton dimension 0. Target sizes: [11]. Tensor sizes: [15]

>>> model.predict([("he lives in america","he is an indian but stays in america")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/gitcodes/semantic-text-similarity/semantic_text_similarity/models/bert/similarity.py", line 111, in predict
    [{'sentence_1': s1, 'sentence_2':s2} for s1, s2 in data], self.bert_tokenizer)
  File "/data/gitcodes/semantic-text-similarity/semantic_text_similarity/models/bert/bert_preprocessing.py", line 77, in bert_sentence_pair_preprocessing
    dataset_input_ids[idx] = torch.tensor(input_ids, dtype=torch.long)
RuntimeError: The expanded size of the tensor (11) must match the existing size (15) at non-singleton dimension 0.  Target sizes: [11].  Tensor sizes: [15]

Problem with space

When I try to find the similarity between statement it shows less similarity for the same words with and without space
Code: web_model.predict([('crm plus','crmplus')])
Output : 0.8329288

But for some other words like web_model.predict([('iphone plus','iphoneplus')]) --> output: 3.4955034(high similarity)

printing out issue

`from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity

web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction

clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])`

So how to print the results if I use print(web_model.predict([("She won an olympic gold medal","The women is an olympic champion")]))
it gave me 4.021563

Citation

Hello,

thank your for sharing that it saved me a lot of time and helped me a lot. I would like to reference it in my paper. Do you have a DOI for your github repository? And maybe a bibtex template?

Thanks!

Why the predict result is so ridiculous?

I have 2 sentences as below:
The videocall is unavailable.
The videocall is available.

The predict result is 4.71. What's wrong?

Interpreting output

How to interpret the output? I am confused with continuous value output.

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity
web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction
clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction
web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction
web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])
array([3.0079894], dtype=float32)
web_model.predict([("I am King","I am king")])
array([4.5483], dtype=float32)
web_model.predict([("I am King","I am not king")])
array([3.6953335], dtype=float32)
web_model.predict([("I am King","I am queen")])
array([4.31337], dtype=float32)

how to change path for download files on win10?

C:\ykprj>python semSimi.py
Downloading model: web-bert-similarity from https://github.com/AndriyMulyar/semantic-text-similarity/releases/download/v1.0.0/web_bert_similarity.tar.gz
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 405359924/405359924 [2:55:39<00:00, 38460.08B/s]
Failed to download model: web-bert-similarity

fileobj = self.myfileobj = builtins.open(filename, mod  File "C:\Program Files\Python37\lib\gzip.py", line 168,

in init
fileobj = self.myfileobj = builtins.open(filename, mod
e or 'rb')
PermissionError: [Errno 13] Permission denied: 'C**********\AppData\Local\Temp\tmpeiscmajk'

Performance on quora qa data set

I used this model on quora qa data set (http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv). Performance of the model is below:
-----------------|Model_output - 0 | |Model_output - 1
is_duplicate -0 | 218,328 | 36,696
is_duplicate -1 | 72,739 | 76,524

Do you have any suggestions for improving the performance of the model.

Code is here:

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity
web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

# Quora

def check_score(row):
return web_model.predict([(row['question1'],row['question2'])])[0]
import pandas as pd
t2 = pd.read_csv("./quora_duplicate_questions.tsv",sep='\t')
t3= t2.dropna()
t3['model_score']=t3.apply(check_score,axis=1)
t3.to_csv("./t3_Jan10.csv",index=False)
t3 = pd.read_csv("./t3_Jan10.csv")
t3[t3.is_duplicate==0]['model_score'].mean()
t3[t3.is_duplicate==1]['model_score'].mean()
t3['model_output']=0
t3.loc[t3.model_score>3.71, 'model_output']=1
pd.crosstab(t3.is_duplicate, t3.model_output)

Failed to download web-bert-similarity model

I am uable to download web bert similarity model using both old as well as new version Can you please help @AndriyMulyar

File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 2091, in call
return self.wsgi_app(environ, start_response)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 2076, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask_cors\extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask_cors\extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask_cors\decorator.py", line 128, in wrapped_function
resp = make_response(f(*args, **kwargs))
File "D:\NLP Projects\semantic-text-similarity\clientApp.py", line 34, in predictRoute
webert_prediction = webbert_object.predict()
File "D:\NLP Projects\semantic-text-similarity\web_bert.py", line 10, in predict
model = WebBertSimilarity(device='cpu', batch_size=10)
File "D:\NLP Projects\semantic-text-similarity\semantic_text_similarity\models\bert\web_similarity.py", line 7, in init
model_path = get_model_path(model_name)
File "D:\NLP Projects\semantic-text-similarity\semantic_text_similarity\models\util.py", line 40, in get_model_path
raise exc
File "D:\NLP Projects\semantic-text-similarity\semantic_text_similarity\models\util.py", line 36, in get_model_path
tar = tarfile.open(temp_file.name)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\tarfile.py", line 1573, in open
return func(name, "r", fileobj, **kwargs)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\tarfile.py", line 1638, in gzopen
fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
PermissionError: [Errno 13] Permission denied: 'C:\Users\Asif\AppData\Local\Temp\tmppi0i4vlj'

Issue in initialising the clinicalSimilarity class.

model = ClinicalBertSimilarity()
TypeError: init_weights() takes 1 positional argument but 2 were given

parameters

Hi, can you share all the parameters of the clinical finetuned model? thank you !

cant install semantic-test-similarity

Collecting semantic-text-similarity
Using cached semantic_text_similarity-1.0.3-py3-none-any.whl (416 kB)
Collecting pytorch-transformers==1.1.0
Using cached pytorch_transformers-1.1.0-py3-none-any.whl (158 kB)
ERROR: Command errored out with exit status 1:
command: 'C:\Users\vdharmalin\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch\setup.py'"'"'; file='"'"'C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\vdharmalin\AppData\Local\Temp\pip-wheel-c1gced_z'
cwd: C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch
Complete output (30 lines):
running bdist_wheel
running build
running build_deps
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch\setup.py", line 225, in
setup(name="torch", version="0.1.2.post2",
File "C:\Users\vdharmalin\Anaconda3\lib\site-packages\setuptools_init_.py", line 153, in setup
return distutils.core.setup(**attrs)
File "C:\Users\vdharmalin\Anaconda3\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "C:\Users\vdharmalin\Anaconda3\lib\distutils\dist.py", line 966, in run_commands
self.run_command(cmd)
File "C:\Users\vdharmalin\Anaconda3\lib\distutils\dist.py", line 985, in run_command
cmd_obj.run()
File "C:\Users\vdharmalin\Anaconda3\lib\site-packages\wheel\bdist_wheel.py", line 223, in run
self.run_command('build')

andriymulyar / semantic-text-similarity Goto Github PK

semantic-text-similarity's Introduction

semantic-text-similarity

Installation

Use

Notes

Acknowledgement

semantic-text-similarity's People

Contributors

Stargazers

Watchers

Forkers

semantic-text-similarity's Issues

# Quora

Recommend Projects

Recommend Topics

Recommend Org