Giter Club home page Giter Club logo

semantic-text-similarity's Introduction

semantic-text-similarity

an easy-to-use interface to fine-tuned BERT models for computing semantic similarity. that's it.

This project contains an interface to fine-tuned, BERT-based semantic text similarity models. It modifies pytorch-transformers by abstracting away all the research benchmarking code for ease of real-world applicability.

Model Dataset Dev. Correlation
Web STS BERT STS-B 0.893
Clinical STS BERT MED-STS 0.854

Installation

Install with pip:

pip install semantic-text-similarity

or directly:

pip install git+https://github.com/AndriyMulyar/semantic-text-similarity

Use

Maps batches of sentence pairs to real-valued scores in the range [0,5]

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity

web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction

clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

More examples.

Notes

  • You will need a GPU to apply these models if you would like any hint of speed in your predictions.
  • Model downloads are cached in ~/.cache/torch/semantic_text_similarity/. Try clearing this folder if you have issues.

Acknowledgement

Clinical models in this project were submitted to the 2019 N2C2 Shared Task Track 1. Implementation and model training in this project was supported by funding from the Mark Dredze Lab at Johns Hopkins University.

semantic-text-similarity's People

Contributors

andriymulyar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

semantic-text-similarity's Issues

dataset

Hi,

I attend the 2019 n2c2 task3, and I found your model is useful om my data, and I'm keeping on study of task 3, so can you give me a way to MedSTS dataset? thank you very much!

Wen.

PermissionError: [Errno 13] Permission denied:

Hello,

When I launch your example, the program begins by download we_bert_similarity.tar.gz, and at the end of the download I have the following error:
Failed to download model: web-bert-similarity Traceback (most recent call last): File "semantic_similarity.py", line 4, in <module> web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\site-packages\semantic_text_similarity\models\bert\web_similarity.py", line 7, in __init__ model_path = get_model_path(model_name) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\site-packages\semantic_text_similarity\models\util.py", line 40, in get_model_path raise exc File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\site-packages\semantic_text_similarity\models\util.py", line 36, in get_model_path tar = tarfile.open(temp_file.name) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\tarfile.py", line 1573, in open return func(name, "r", fileobj, **kwargs) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\tarfile.py", line 1638, in gzopen fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj) File "F:\Users\rachidj\Anaconda3\envs\nlpenv\lib\gzip.py", line 163, in __init__ fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') PermissionError: [Errno 13] Permission denied: 'C:\\Users\\rachidj\\AppData\\Local\\Temp\\tmpq6i6wt_s'

So I downloaded them manually, but I don't know where put them to make the example script working

Loading of WebBertSimilarity model breaks

Hello,

While using this repo, I noticed that the web-bert-similarity model fails to load. After some digging, I realized that the authors of hugging-face/pytorch-transformers have released a new version i.e. 1.2.0 resulting in your code breaking at line 27 of /models/bert/similarity.py.
I was able to fix it by enforcing the older version of pytorch-transformers (1.1.0) and thought that I should give you heads up about the same in case you weren't aware.

Thanks and good luck!

problem when swapped sentences

I load the model and try to predict the similarity between sentence A and sentence B. when I change the order of these sentences (i change the place of sentence A and B and swapped them), i get different prediction value for similarity. Why this happen?is not this measure symmetric?

Is BERT actually good for semantic similarity.. check the example

As per score below , second case is more semantic similar than other one. But in actual it is just the opposite.

>>> model.predict([("he is an indian", "he has indian citizenship")])
array([3.2054904], dtype=float32)
>>> model.predict([("he is an indian", "he is  not an indian")])
array([3.590286], dtype=float32)

RuntimeError: The expanded size of the tensor (11) must match the existing size (15) at non-singleton dimension 0. Target sizes: [11]. Tensor sizes: [15]

>>> model.predict([("he lives in america","he is an indian but stays in america")])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/gitcodes/semantic-text-similarity/semantic_text_similarity/models/bert/similarity.py", line 111, in predict
    [{'sentence_1': s1, 'sentence_2':s2} for s1, s2 in data], self.bert_tokenizer)
  File "/data/gitcodes/semantic-text-similarity/semantic_text_similarity/models/bert/bert_preprocessing.py", line 77, in bert_sentence_pair_preprocessing
    dataset_input_ids[idx] = torch.tensor(input_ids, dtype=torch.long)
RuntimeError: The expanded size of the tensor (11) must match the existing size (15) at non-singleton dimension 0.  Target sizes: [11].  Tensor sizes: [15]

Problem with space

When I try to find the similarity between statement it shows less similarity for the same words with and without space
Code: web_model.predict([('crm plus','crmplus')])
Output : 0.8329288

But for some other words like web_model.predict([('iphone plus','iphoneplus')]) --> output: 3.4955034(high similarity)

printing out issue

`from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity

web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction

clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])`

So how to print the results if I use print(web_model.predict([("She won an olympic gold medal","The women is an olympic champion")]))
it gave me 4.021563

Citation

Hello,

thank your for sharing that it saved me a lot of time and helped me a lot. I would like to reference it in my paper. Do you have a DOI for your github repository? And maybe a bibtex template?

Thanks!

Interpreting output

How to interpret the output? I am confused with continuous value output.

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity
web_model = WebBertSimilarity(device='cpu', batch_size=10) #defaults to GPU prediction
clinical_model = ClinicalBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction
web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction
web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])
array([3.0079894], dtype=float32)
web_model.predict([("I am King","I am king")])
array([4.5483], dtype=float32)
web_model.predict([("I am King","I am not king")])
array([3.6953335], dtype=float32)
web_model.predict([("I am King","I am queen")])
array([4.31337], dtype=float32)

how to change path for download files on win10?

C:\ykprj>python semSimi.py
Downloading model: web-bert-similarity from https://github.com/AndriyMulyar/semantic-text-similarity/releases/download/v1.0.0/web_bert_similarity.tar.gz
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 405359924/405359924 [2:55:39<00:00, 38460.08B/s]
Failed to download model: web-bert-similarity

fileobj = self.myfileobj = builtins.open(filename, mod  File "C:\Program Files\Python37\lib\gzip.py", line 168,

in init
fileobj = self.myfileobj = builtins.open(filename, mod
e or 'rb')
PermissionError: [Errno 13] Permission denied: 'C**********\AppData\Local\Temp\tmpeiscmajk'

Performance on quora qa data set

I used this model on quora qa data set (http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv). Performance of the model is below:
-----------------|Model_output - 0 | |Model_output - 1
is_duplicate -0 | 218,328 | 36,696
is_duplicate -1 | 72,739 | 76,524

Do you have any suggestions for improving the performance of the model.

Code is here:

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity
web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

# Quora

def check_score(row):
return web_model.predict([(row['question1'],row['question2'])])[0]
import pandas as pd
t2 = pd.read_csv("./quora_duplicate_questions.tsv",sep='\t')
t3= t2.dropna()
t3['model_score']=t3.apply(check_score,axis=1)
t3.to_csv("./t3_Jan10.csv",index=False)
t3 = pd.read_csv("./t3_Jan10.csv")
t3[t3.is_duplicate==0]['model_score'].mean()
t3[t3.is_duplicate==1]['model_score'].mean()
t3['model_output']=0
t3.loc[t3.model_score>3.71, 'model_output']=1
pd.crosstab(t3.is_duplicate, t3.model_output)

Failed to download web-bert-similarity model

I am uable to download web bert similarity model using both old as well as new version Can you please help @AndriyMulyar

File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 2091, in call
return self.wsgi_app(environ, start_response)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 2076, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask_cors\extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask_cors\extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask\app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\site-packages\flask_cors\decorator.py", line 128, in wrapped_function
resp = make_response(f(*args, **kwargs))
File "D:\NLP Projects\semantic-text-similarity\clientApp.py", line 34, in predictRoute
webert_prediction = webbert_object.predict()
File "D:\NLP Projects\semantic-text-similarity\web_bert.py", line 10, in predict
model = WebBertSimilarity(device='cpu', batch_size=10)
File "D:\NLP Projects\semantic-text-similarity\semantic_text_similarity\models\bert\web_similarity.py", line 7, in init
model_path = get_model_path(model_name)
File "D:\NLP Projects\semantic-text-similarity\semantic_text_similarity\models\util.py", line 40, in get_model_path
raise exc
File "D:\NLP Projects\semantic-text-similarity\semantic_text_similarity\models\util.py", line 36, in get_model_path
tar = tarfile.open(temp_file.name)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\tarfile.py", line 1573, in open
return func(name, "r", fileobj, **kwargs)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\tarfile.py", line 1638, in gzopen
fileobj = gzip.GzipFile(name, mode + "b", compresslevel, fileobj)
File "C:\Users\Asif\anaconda3\envs\semantic-text-similarity\lib\gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
PermissionError: [Errno 13] Permission denied: 'C:\Users\Asif\AppData\Local\Temp\tmppi0i4vlj'

parameters

Hi, can you share all the parameters of the clinical finetuned model? thank you !

cant install semantic-test-similarity

Collecting semantic-text-similarity
Using cached semantic_text_similarity-1.0.3-py3-none-any.whl (416 kB)
Collecting pytorch-transformers==1.1.0
Using cached pytorch_transformers-1.1.0-py3-none-any.whl (158 kB)
ERROR: Command errored out with exit status 1:
command: 'C:\Users\vdharmalin\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch\setup.py'"'"'; file='"'"'C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\vdharmalin\AppData\Local\Temp\pip-wheel-c1gced_z'
cwd: C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch
Complete output (30 lines):
running bdist_wheel
running build
running build_deps
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\vdharmalin\AppData\Local\Temp\pip-install-cnw3ben9\torch\setup.py", line 225, in
setup(name="torch", version="0.1.2.post2",
File "C:\Users\vdharmalin\Anaconda3\lib\site-packages\setuptools_init_.py", line 153, in setup
return distutils.core.setup(**attrs)
File "C:\Users\vdharmalin\Anaconda3\lib\distutils\core.py", line 148, in setup
dist.run_commands()
File "C:\Users\vdharmalin\Anaconda3\lib\distutils\dist.py", line 966, in run_commands
self.run_command(cmd)
File "C:\Users\vdharmalin\Anaconda3\lib\distutils\dist.py", line 985, in run_command
cmd_obj.run()
File "C:\Users\vdharmalin\Anaconda3\lib\site-packages\wheel\bdist_wheel.py", line 223, in run
self.run_command('build')

Does it support multi threading and multi processing

I tried to run this model against a data set containing about 40k items. With current rate, it may take more than 200 days. Is there any support available for multi threading/multi processing, the goal is to reduce the time taken for each prediction.

I tried few basic things and CUDA ran out of memory in between. Any help is much appreciated.

Embeddings

I used your model to predict similarity on variety of data-sets, and I was able to use it, but In places I needed only the word embedding, could you help me how to generate word embedding with your implementation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.