centre-for-humanities-computing / dacy Goto Github PK

View Code? Open in Web Editor NEW

89.0 5.0 20.0 22.81 MB

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Home Page: https://centre-for-humanities-computing.github.io/DaCy/

License: Apache License 2.0

Python 98.84% Shell 0.47% Makefile 0.70%

spacy reproducible-workflows danish-language natural-language-processing

dacy's Issues

No tests for tutorials

Currently, there are no CI tests of the tutorials.

sunglasses wrapping a fine-tuned Tranformer

sunglasses wrapping a fine-tuned Tranformer is in tutorials section - missing an s.

Which page or section is this issue related to?

Pos tags

Currently, the pos-tag resides in the .tag_ instead of the pos_ where it should (also) be.

Add website with documentation

Using Sphinx and potentially using:
https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring

Installing DaCy downgrades huggingface-hub to 0.0.12, leading to incompatible load_dataset

How to reproduce the behaviour

Install daCy
from datasets import load_dataset

Gives me an error :-)

Import morphologizer

Currently, the model includes no morphologizer. This can easily be imported from the other danish models.

Add UDPipe to comparisons

For instance using the spacy implementation: https://github.com/TakeLab/spacy-udpipe

Examine discrepancy between spaCy dependency parse and DaCy dependency parse

Examine a potential discrepancy between spaCy dependency parse and DaCy dependency parse noted by @rdkm89.

Add resource: Danish word frequencies DAGW

Add word frequencies estimated from Danish Gigaword as a language resource.

Add twitter extension for extracting hashtags and mentions from a doc

and URLs

Upload names.csv to huggingface model hub

Update the classification transformer scripts to work with spacy-transformers 1.1.0 and above

spacy-transformers changed the API when going from 1.0.x to 1.1.x and the classification wrapper will need to be changed accordingly.

Add Coreference resolution model

Add the DaNLP and Twitter coref. models.

Add LIT for the Danish Benchmark which the models are applied to.

LIT is a great tool for exploring model predictions. I would love to add a tutorial on using LIT with DaCy.

https://pair-code.github.io/lit/setup/

Add emoji classification to DaCy

This is going to be great -> 🎉
(might be doable with simply MLM)

replace augmenter with augmenty

The new spacy project for augmenters, augmenty, is a much more suitable place for augmenter than DaCy. The augmenter will thus be moved here instead.

Download of large transformer model 0.1.0 does not work

How to reproduce the behaviour

>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
Defaulting to user installation because normal site-packages is not writeable
Collecting da-dacy-large-trf==any
  ERROR: HTTP error 404 while getting https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl
ERROR: Could not install requirement da-dacy-large-trf==any from https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl because of HTTP error 404 Client Error: Not Found for url: https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl for URL https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl

Your Environment

DaCy Version Used: 1.1.3
Operating System: Pop!_OS 20.04 LTS
Python Version Used: 3.8.10
spaCy Version Used: 3.1.2
Environment Information:

Note

I note that the subdirectory part "da_dacy_medium_trf" does not correspond to da_dacy_large_trf-any-py3-none-any.whl

Changing the subdirectory in the URL:

python3 -m pip install https://huggingface.co/chcaa/da_dacy_large_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl

seems to work:

python3
>>> import dacy
>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
>>> doc = nlp('Hej verden.')
/usr/local/lib/python3.8/dist-packages/spacy/pipeline/attributeruler.py:108: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  matches = self.matcher(doc, allow_missing=True)
>>> doc._.trf_data.tensors[-1].shape
(1, 1024)

Add Tutorials: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"

After removing readability it would be nice with a tutorial on: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"

Potentially using the packages to describe the examining the language complexity between conversational data and legal documents on DAGW or a similar task using a publicly available dataset.

Add sota NER model by Dan Nielsen

It should be possible using spacy-wrap 1.2.0.

I would like to configure the cache dir with an environmental variable instead of using the default ~/.dacy. This is a relatively simple change, but it useful if you want the cache to be in a known place (i.e. a specific folder in a docker image). I created a PR for this feature (#67). Some projects like to have associated issues so I am creating an issue as well.

Add discussion board

Add a community discussion board.

Address cuda warnings and spaCy version warning.

When running:

import dacy

for model in dacy.models():
    print(model)

dacy_nlp = dacy.load('medium')

doc = dacy_nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering bygget i SpaCy.")

print('hej')

I get the following warning:


da_dacy_small_tft-0.0.0
da_dacy_medium_tft-0.0.0
da_dacy_large_tft-0.0.0
da_dacy_small_trf-0.1.0
da_dacy_medium_trf-0.1.0
da_dacy_large_trf-0.1.0
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_medium_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_small_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
/venv/lib/python3.9/site-packages/spacy/pipeline/attributeruler.py:150: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  matches = self.matcher(doc, allow_missing=True, as_spans=False)
hej

Notably this this includes three warning, including SpaCy version, cuda device and matcher object (see also #72)

originally version sent to me by mail

Note: While this is a warning there, DaCy still works as intended. The version of spaCy does not influence model performance.

ContextualVersionConflict Traceback (most recent call last)

Moved from #133, originally posted by @EaLindhardt

I've tried to download dacy through anaconda, both with pip and conda install and the different ways of installing: https://centre-for-humanities-computing.github.io/DaCy/installation.html

when running

import dacy

i get the following

`---------------------------------------------------------------------------
ContextualVersionConflict Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 import dacy

File ~\AppData\Roaming\Python\Python39\site-packages\dacy_init_.py:4, in
1 from dacy.hate_speech import make_offensive_transformer # noqa
2 from dacy.sentiment import make_emotion_transformer # noqa
----> 4 from .about import download_url, title, version # noqa
5 from .download import download_model # noqa
6 from .load import load, models, where_is_my_dacy

File ~\AppData\Roaming\Python\Python39\site-packages\dacy\about.py:3, in
1 import pkg_resources
----> 3 version = pkg_resources.get_distribution("dacy").version
4 title = "dacy"
5 download_url = "https://github.com/centre-for-humanities-computing/DaCy"

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:477, in get_distribution(dist)
475 dist = Requirement.parse(dist)
476 if isinstance(dist, Requirement):
--> 477 dist = get_provider(dist)
478 if not isinstance(dist, Distribution):
479 raise TypeError("Expected string, Requirement, or Distribution", dist)

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:353, in get_provider(moduleOrReq)
351 """Return an IResourceProvider for the named module or requirement"""
352 if isinstance(moduleOrReq, Requirement):
--> 353 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
354 try:
355 module = sys.modules[moduleOrReq]

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:897, in WorkingSet.require(self, *requirements)
888 def require(self, *requirements):
889 """Ensure that distributions matching requirements are activated
890
891 requirements must be a string or a (possibly-nested) sequence
(...)
895 included, even if they were already activated in this working set.
896 """
--> 897 needed = self.resolve(parse_requirements(requirements))
899 for dist in needed:
900 self.add(dist)

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:788, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras)
785 if dist not in req:
786 # Oops, the "best" so far conflicts with a dependency
787 dependent_req = required_by[req]
--> 788 raise VersionConflict(dist, req).with_context(dependent_req)
790 # push the new requirements onto the stack
791 new_requirements = dist.requires(req.extras)[::-1]

ContextualVersionConflict: (spacy 3.3.1 (c:\users\au576018\anaconda3\lib\site-packages), Requirement.parse('spacy<3.3.0,>=3.2.0'), {'dacy'})`

How do I solve this?

@EaLindhardt will you please add the following information:

DaCy Version Used:
Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

you can also type python -m spacy info --markdown and copy-paste the result here along with the DaCy version, which you can get using python -c "import dacy; print(dacy.__version__)"

Add CI for testing tutorials

Automatically run pytests on PR
Automatically run action for code coverage.
Automatically upload to pipy (will be added in version 1)
Automatically run action for testing tutorials.

Change script to train with ray

Currently, the training scripts do not train with ray, which would make the training much faster. However, seemingly enabling ray makes the model not train at all.

Check potential issue with tokenization

Check if
å -> a

if that is the case

set:
"strip_accents": False

when training models (shouldn't be a big issue as the NN should model this discrepancy)

also check how the tokenization deals with emojis.

DaVADER: use lemmas instead of stems

@Guscode and @jacdals97 mentioned that they could extract the correct lemmas from the initial tagged data from SentiDa instead of using the stems. This will likely improve the current performance of DaVADER.

Move to Center for Humanities Computing

Move to the Center for humanities Computing organization. To officially show that it is under CHC support.

Add a spelling correction module

Potentially using something like:
https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/

Another interesting read might be this blogpost by grammarly:
https://www.grammarly.com/blog/engineering/gec-tag-not-rewrite/

Here is might also be relevant to check out grammarly gector:
https://github.com/grammarly/gector

Potentially also check out:
https://github.com/PrithivirajDamodaran/Gramformer

missing score

Score is missing from the documentation despite being well documented in code.

requirement versions too strict

Description

The versions specified for some requirements are too strict. Specifically, I am trying to install DaCy with dvc and got an error that the versions of tqdm are incompatible. For tqdm specifically, I don't think you need to specify a version because it's pretty hard to break it and it's a requirement way upstream. But as an example, I would use the smallest minor version that you want to the next major version, which is how spacy does it. I would do this with pandas as well.

Your Environment

DaCy Version Used: 1.2
Operating System: linux
Environment Information: installing with dvc (but this will happen quite a bit)

Cannot load as SentenceTransformer

I'm doing a project where I compare different Danish embedding models using the sentence-transformers library. However, when I tried to load the model (from HuggingFace) I get the following error:

WARNING:root:No sentence-transformers model found with name C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf. Creating a new one with MEAN pooling.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 89, in init
modules = self._load_auto_model(model_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 790, in _load_auto_model
transformer_model = Transformer(model_name_or_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\models\Transformer.py", line 28, in init
config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\models\auto\configuration_auto.py", line 527, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\configuration_utils.py", line 546, in get_config_dict
resolved_config_file = cached_path(
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\file_utils.py", line 1420, in cached_path
raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf\config.json as a URL or as a local path

The error persists both locally and in Google Colab. I can successfully load other models such as Ælæctra, which is why it might be a DaCy specific issue. If there's any way to fix this, please let me know :))

How to reproduce the behaviour

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("chcaa/da_dacy_small_trf")

Your Environment

DaCy Version Used: NA (only interacted through sentence-transformers 2.1.0)
Operating System: Windows 10
Python Version Used: 3.8.11
spaCy Version Used: NA

replace davader with asent

When the new version of asent is ready, replacing it with davader implementation in DaCy seems like a more stable solution.

adding wikiANN and plank

based on the evaluation by daLUKE it might be relevant to add wikiANN and plank as OOB datasets for NER.

removing readability

DaCy contains a few readability measures. The package textdescriptives contains a much better implementation of readability measures. Thus we recommend using

bias detection using augmentation

@martincjespersen brought to my attention that there is currently not check of biases in Danish NLP models. It is probably possible to use the augmentation which is already in development and check performance using using the spacy scorer.

augmentation issue:
#4

Add table with model parameters, layers etc. in the training markdown

Performance differences between DaNLP and SpaCy benchmark

@martincjespersen notes that the SpaCy model performs slightly worse (but still SOTA) when using it is forced to use the original tokens of the DaNE corpus (for person NER). I assume this is how performance is tested using SpaCy's evaluate script as well so there seems to be a discrepancy between the two evaluate functions.

Testing DaNLP evaluate script against SpaCy scorer should resolve the issue.

centre-for-humanities-computing / dacy Goto Github PK

dacy's Issues

Which page or section is this issue related to?

How to reproduce the behaviour

How to reproduce the behaviour

Your Environment

Note

Description

Your Environment

How to reproduce the behaviour

Your Environment

Recommend Projects

Recommend Topics

Recommend Org