bm-k / kosentencebert-skt Goto Github PK

View Code? Open in Web Editor NEW

131.0 2.0 30.0 46.64 MB

Sentence Embeddings using Siamese SKT KoBERT-Networks

Python 100.00%

sentence-bert sentence-similarity natural-language-processing korean-sentence-bert

kosentencebert-skt's Issues

IndexError: index 16 is out of bounds for dimension 0 with size 16

corpus의 길이 570글자 안으로 실행할 때 오류 안 뜨는데 570글자 초과하면 위와 같은 오류 뜹니다.

모델 load시 "TypeError: init() got an unexpected keyword argument 'return_dict'" 오류 발생

docs에 설명대로 transformers, sentence-transformers, tokenizers를 동일한 경로 "opt/conda/lib/python3.7/site-packages/ "로
이동한 후 실행했으나 아래 오류가 출력됩니다.

원인 알 수 있을까요??

현재 사용중인 library들의 버전은 아래와 같습니다.

alabaster 0.7.12
anaconda-client 1.9.0
anaconda-navigator 2.1.1
anaconda-project 0.10.1
anyio 2.2.0
appdirs 1.4.4
applaunchservices 0.2.1
appnope 0.1.2
appscript 1.1.2
argh 0.26.2
argon2-cffi 20.1.0
arrow 0.13.1
asgiref 3.4.1
asn1crypto 1.4.0
astroid 2.6.6
astropy 4.3.1
async-generator 1.10
atomicwrites 1.4.0
attrs 21.2.0
autopep8 1.5.7
Babel 2.9.1
backcall 0.2.0
backports.functools-lru-cache 1.6.4
backports.shutil-get-terminal-size 1.0.0
backports.tempfile 1.0
backports.weakref 1.0.post1
beautifulsoup4 4.10.0
binaryornot 0.4.4
bitarray 2.3.0
bkcharts 0.2
black 19.10b0
bleach 4.0.0
bokeh 2.4.1
boto 2.49.0
Bottleneck 1.3.2
brotlipy 0.7.0
cached-property 1.5.2
certifi 2021.10.8
cffi 1.14.6
chardet 4.0.0
charset-normalizer 2.0.4
click 8.0.3
cloudpickle 2.0.0
clyent 1.2.2
colorama 0.4.4
conda 4.12.0
conda-build 3.21.5
conda-content-trust 0+unknown
conda-pack 0.6.0
conda-package-handling 1.7.3
conda-repo-cli 1.0.4
conda-token 0.3.0
conda-verify 3.4.2
contextlib2 0.6.0.post1
cookiecutter 1.7.2
cryptography 3.4.8
cycler 0.10.0
Cython 0.29.24
cytoolz 0.11.0
daal4py 2021.3.0
dask 2021.10.0
debugpy 1.4.1
decorator 5.1.0
defusedxml 0.7.1
diff-match-patch 20200713
distributed 2021.10.0
Django 1.8.13
django-cors-headers 2.2.0
django-extensions 1.6.7
django-tenants 1.1.7
djangorestframework 3.3.3
docutils 0.17.1
entrypoints 0.3
et-xmlfile 1.1.0
fastcache 1.1.0
filelock 3.3.1
flake8 3.9.2
Flask 1.1.2
fonttools 4.25.0
fsspec 2021.8.1
future 0.18.2
gevent 21.8.0
glob2 0.7
gmpy2 2.0.8
greenlet 1.1.1
h5py 3.2.1
HeapDict 1.0.1
html5lib 1.1
huggingface-hub 0.4.0
idna 3.2
imagecodecs 2021.8.26
imageio 2.9.0
imagesize 1.2.0
importlib-metadata 4.8.1
inflection 0.5.1
iniconfig 1.1.1
intervaltree 3.1.0
ipykernel 6.4.1
ipython 7.29.0
ipython-genutils 0.2.0
ipywidgets 7.6.5
isort 5.9.3
itsdangerous 2.0.1
jdcal 1.4.1
jedi 0.18.0
Jinja2 2.11.3
jinja2-time 0.2.0
joblib 1.1.0
json5 0.9.6
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.12
jupyter-console 6.4.0
jupyter-core 4.8.1
jupyter-server 1.4.1
jupyterlab 3.2.1
jupyterlab-pygments 0.1.2
jupyterlab-server 2.8.2
jupyterlab-widgets 1.0.0
keyring 23.1.0
kiwisolver 1.3.1
lazy-object-proxy 1.6.0
libarchive-c 2.9
libpysal 4.6.2
llvmlite 0.37.0
locket 0.2.1
lxml 4.6.3
MarkupSafe 1.1.1
matplotlib 3.4.3
matplotlib-inline 0.1.2
mccabe 0.6.1
mistune 0.8.4
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
mock 4.0.3
more-itertools 8.10.0
mpmath 1.2.1
msgpack 1.0.2
multipledispatch 0.6.0
munkres 1.1.4
mypy-extensions 0.4.3
navigator-updater 0.2.1
nbclassic 0.2.6
nbclient 0.5.3
nbconvert 6.1.0
nbformat 5.1.3
nest-asyncio 1.5.1
networkx 2.6.3
nltk 3.6.5
nose 1.3.7
notebook 6.4.5
numba 0.54.1
numexpr 2.7.3
numpy 1.20.3
numpydoc 1.1.0
olefile 0.46
openpyxl 3.0.9
packaging 21.0
pandas 1.3.4
pandocfilters 1.4.3
parso 0.8.2
partd 1.2.0
path 16.0.0
pathlib2 2.3.6
pathspec 0.7.0
patsy 0.5.2
pep8 1.7.1
pexpect 4.8.0
pickleshare 0.7.5
Pillow 8.4.0
pip 21.2.4
pkginfo 1.7.1
pluggy 0.13.1
ply 3.11
poyo 0.5.0
prometheus-client 0.11.0
prompt-toolkit 3.0.20
psutil 5.8.0
psycopg2 2.8.6
ptyprocess 0.7.0
py 1.10.0
pycodestyle 2.7.0
pycosat 0.6.3
pycparser 2.20
pycurl 7.44.1
pydocstyle 6.1.1
pyerfa 2.0.0
pyflakes 2.3.1
Pygments 2.10.0
PyJWT 2.1.0
pylint 2.9.6
pyls-spyder 0.4.0
pyodbc 4.0.0-unsupported
pyOpenSSL 21.0.0
pyparsing 3.0.4
pyrsistent 0.18.0
PySocks 1.7.1
pytest 6.2.4
python-dateutil 2.8.2
python-lsp-black 1.0.0
python-lsp-jsonrpc 1.0.0
python-lsp-server 1.2.4
python-memcached 1.59
python-slugify 5.0.2
pytz 2021.3
PyWavelets 1.1.1
PyYAML 6.0
pyzmq 22.2.1
QDarkStyle 3.0.2
qstylizer 0.1.10
QtAwesome 1.0.2
qtconsole 5.1.1
QtPy 1.10.0
regex 2021.8.3
requests 2.26.0
rope 0.19.0
Rtree 0.9.7
ruamel-yaml-conda 0.15.100
sacremoses 0.0.47
scikit-image 0.18.3
scikit-learn 1.0.2
scikit-learn-intelex 2021.20210714.100439
scipy 1.7.3
seaborn 0.11.2
Send2Trash 1.8.0
sentence-transformers 2.1.0
sentencepiece 0.1.96
setuptools 58.0.4
simplegeneric 0.8.1
singledispatch 3.7.0
sip 4.19.13
six 1.16.0
sklearn 0.0
sniffio 1.2.0
snowballstemmer 2.1.0
sortedcollections 2.1.0
sortedcontainers 2.4.0
soupsieve 2.2.1
Sphinx 4.2.0
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
sphinxcontrib-websupport 1.2.4
spyder 5.1.5
spyder-kernels 2.1.3
SQLAlchemy 1.4.22
sqlparse 0.4.1
statsmodels 0.12.2
sympy 1.9
tables 3.6.1
TBB 0.2
tblib 1.7.0
terminado 0.9.4
testpath 0.5.0
text-unidecode 1.3
textdistance 4.2.1
threadpoolctl 2.2.0
three-merge 0.1.1
tifffile 2021.7.2
tinycss 0.4
tokenizers 0.11.6
toml 0.10.2
toolz 0.11.1
torch 1.10.2
torchvision 0.11.3
tornado 6.1
tqdm 4.62.3
traitlets 5.1.0
transformers 4.16.2
typed-ast 1.4.3
typing-extensions 3.10.0.2
ujson 4.0.2
unicodecsv 0.14.1
Unidecode 1.2.0
urllib3 1.26.7
watchdog 2.1.3
wcwidth 0.2.5
webencodings 0.5.1
Werkzeug 2.0.2
wheel 0.37.0
whichcraft 0.6.1
widgetsnbextension 3.5.1
wrapt 1.12.1
wurlitzer 2.1.1
xlrd 2.0.1
XlsxWriter 3.0.1
xlwings 0.24.9
xlwt 1.3.0
xmltodict 0.12.0
yapf 0.31.0
zict 2.0.0
zipp 3.6.0
zope.event 4.5.0
zope.interface 5.4.0

오류내용
using cached model. /workspace/opt/conda/lib/python3.7/site-packages/.cache/kobert_v1.zip
using cached model. /workspace/opt/conda/lib/python3.7/site-packages/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece

TypeError Traceback (most recent call last)
in
26
27 model_path = '/workspace/DBP/data_storage/wontae_kim/Pre_Trained_Model/KoBERT/KoSentenceBERT_SKTBERT/output/training_sts/'
---> 28 embedder = SentenceTransformer(model_path)

/workspace/opt/sentence_transformers/SentenceTransformer.py in init(self, model_name_or_path, modules, device)

/workspace/opt/sentence_transformers/models/Transformer.py in load(input_path)

/workspace/opt/sentence_transformers/models/Transformer.py in init(self, model_name_or_path, max_seq_length, model_args, cache_dir, tokenizer_args, isKor, isLoad)

/workspace/opt/kobert/pytorch_kobert.py in get_pytorch_kobert_model(ctx, cachedir)

/workspace/opt/kobert/pytorch_kobert.py in get_kobert_model(model_path, vocab_file, ctx)

/workspace/opt/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
510
511 # Instantiate model.
--> 512 model = cls(config, *model_args, **model_kwargs)
513
514 if state_dict is None and not from_tf:

TypeError: init() got an unexpected keyword argument 'return_dict'

FileNotFoundError: [Errno 2] No such file or directory

안녕하세요.
README에 있는 sts pre-trained 모델을 사용하려고 합니다.

그러나 FileNotFoundError: [Errno 2] No such file or directory: './KoSentenceBERT_SKT/output/training_sts/0_Transformer/result.pt' 에러가 발생합니다.

깃허브에 있는 output/training_sts에는 0_transformer 폴더가 존재하지 않습니다.
단순히 0_transformer 폴더만 만들면 될까요?

training_sts.zip 불러오기 오류

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/output/training_sts.zip

이런식으로 에러가 뜨는데 어떤것이 문제인지 알 수 있을까요..?

tokenizers init

I tried to clone and run your code but Transformer version did not support AddToken function. Therefore, KoSentenceBERRT_SKTBERT/tokenizers/init raised errors.
I solved by reinstalling transformers and tokenizers, adding AddToken and related functions directly to the init.py file as below:

re-installing transformers and tokenizers:

!pip uninstall transformers
!pip install transformers
!pip uninstall tokenizers
!pip install tokenizers

re-installed pakages version:

Name: tokenizers
Version: 0.9.4

Name: transformers
Version: 4.1.1

adding AddToken and related functions directly to the init.py file:

__version__ = "0.5.2"

from dataclasses import dataclass,field

try:
    import tokenizers
    _tokenizers_available=True
except ImportError:
    _tokenizers_available=False

def is_tokenizers_available():
    return _tokenizers_available


@dataclass(frozen=True, eq=True)
class AddedToken:
    """
    AddedToken represents a token to be added to a Tokenizer An AddedToken can have special options defining the
    way it should behave.
    """

    content: str = field(default_factory=str)
    single_word: bool = False
    lstrip: bool = False
    rstrip: bool = False
    normalized: bool = True

    def __getstate__(self):
        return self.__dict__

@dataclass
class EncodingFast:
    """ This is dummy class because without the `tokenizers` library we don't have these objects anyway """

    pass



from .tokenizers import Tokenizer, Encoding

same as original:

from .tokenizers import decoders
from .tokenizers import models
from .tokenizers import normalizers
from .tokenizers import pre_tokenizers
from .tokenizers import processors
from .tokenizers import trainers
from .implementations import (
    ByteLevelBPETokenizer,
    CharBPETokenizer,
    SentencePieceBPETokenizer,
    BertWordPieceTokenizer,
)

sentence_xlncet_config.json 오류

안녕하세요,
작성해 주신 코드 바탕으로 진행중에
No such file or directory: './KoSentenceBERT_SKTBERT/output/training_sts/0_Transformer/sentence_xlnet_config.json'

와 같은 오류가 뜨는데 혹시 sentence_xlnet_config.json 파일은 어떻게 생성되는 건가요?
0_Transformer에 구성되어야 하는 다른 파일들이 있는지 궁금합니다!

sentence_xlnet_config.json 오류

안녕하세요,
작성해 주신 코드 바탕으로 진행중에
No such file or directory: './KoSentenceBERT_SKTBERT/output/training_sts/0_Transformer/sentence_xlnet_config.json'

와 같은 오류가 뜨는데 혹시 sentence_xlnet_config.json 파일은 어떻게 생성되는 건가요?

Using pretrained model with cpu

embedder = SentenceTransformer(model_path)

여기서 모델을 로드하면

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

이러한 에러가 뜹니다. gpu가 없어서 cpu로 pretrained 모델을 사용하고 싶은데 가능한가요?

SemanticSearch.py tokenizers.tokenizers error+해결과정공유

안녕하세요, 좋은 코드를 공유해주셔서 감사합니다.

다름이 아니라 SemanticSearch.py를 실행하면서 tokenizers.tokenizers 오류가 났습니다.
다른 분들이 올리신 이슈에 따라 tokenizers init.py를 수정하기도 하고, requirements.txt에 있는 transformers==2.8.0 을 실행하지 않고 transformers 폴더를 site-packages에 넣기만 했는데도 해결되지 않아 이슈 드립니다.

os : window 10
anaconda
python 3.7

모든 절차를 그대로 따라했고, sentence_transformers, tokenizers, transformers 모두 site-packages로 이동했습니다.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.