centre-for-humanities-computing / dacy Goto Github PK

View Code? Open in Web Editor NEW

91.0 5.0 20.0 22.81 MB

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Home Page: https://centre-for-humanities-computing.github.io/DaCy/

License: Apache License 2.0

Python 98.84% Shell 0.47% Makefile 0.70%

spacy reproducible-workflows danish-language natural-language-processing

dacy's Introduction

DaCy: An efficient and unified framework for danish NLP

DaCy is a Danish natural language preprocessing framework made with SpaCy. Its largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency parsing for Danish. Feel free to try out the demo. This repository contains material for using DaCy, reproducing the results and guides on the usage of the package. Furthermore, it also contains behavioral tests for biases and robustness of Danish NLP pipelines.

🔧 Installation

You can install dacy via pip from PyPI:

pip install dacy

👩‍💻 Usage

To use the model you first have to download either the small, medium, or large model. To see a list of all available models:

import dacy
for model in dacy.models():
    print(model)
# ...
# da_dacy_small_trf-0.2.0
# da_dacy_medium_trf-0.2.0
# da_dacy_large_trf-0.2.0

To download and load a model simply execute:

nlp = dacy.load("da_dacy_medium_trf-0.2.0")
# or equivalently (always loads the latest version)
nlp = dacy.load("medium")

To see more examples, see the documentation.

📖 Documentation

Documentation
📚 Getting started	Guides and instructions on how to use DaCy and its features.
🦾 Performance	A detailed description of the performance of DaCy and comparison with similar Danish models
📰 News and changelog	New additions, changes and version history.
🎛 API References	The detailed reference for DaCy's API. Including function documentation
🙋 FAQ	Frequently asked questions

Training and reproduction

The folder training contains a range of folders with a SpaCy project for each model version. This allows for the reproduction of the results.

Want to learn more about how DaCy initially came to be, check out this blog post.

💬 Where to ask questions

To report issues or request features, please use the GitHub Issue Tracker. Questions related to SpaCy are kindly referred to the SpaCy GitHub or forum. Otherwise, please use the Discussion Forums.

Type
📚 FAQ	FAQ
🚨 Bug Reports	GitHub Issue Tracker
🎁 Feature Requests & Ideas	GitHub Issue Tracker
👩‍💻 Usage Questions	GitHub Discussions
🗯 General Discussion	GitHub Discussions

dacy's People

Contributors

Stargazers

Watchers

dacy's Issues

Change script to train with ray

Currently, the training scripts do not train with ray, which would make the training much faster. However, seemingly enabling ray makes the model not train at all.

Remove matcher from pipeline to avoid raised warning.

Add discussion board

Add a community discussion board.

Add Tutorials: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"

After removing readability it would be nice with a tutorial on: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"

Potentially using the packages to describe the examining the language complexity between conversational data and legal documents on DAGW or a similar task using a publicly available dataset.

Pos tags

Currently, the pos-tag resides in the .tag_ instead of the pos_ where it should (also) be.

removing readability

DaCy contains a few readability measures. The package textdescriptives contains a much better implementation of readability measures. Thus we recommend using

Download of large transformer model 0.1.0 does not work

How to reproduce the behaviour

>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
Defaulting to user installation because normal site-packages is not writeable
Collecting da-dacy-large-trf==any
  ERROR: HTTP error 404 while getting https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl
ERROR: Could not install requirement da-dacy-large-trf==any from https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl because of HTTP error 404 Client Error: Not Found for url: https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl for URL https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl

Your Environment

DaCy Version Used: 1.1.3
Operating System: Pop!_OS 20.04 LTS
Python Version Used: 3.8.10
spaCy Version Used: 3.1.2
Environment Information:

Note

I note that the subdirectory part "da_dacy_medium_trf" does not correspond to da_dacy_large_trf-any-py3-none-any.whl

Changing the subdirectory in the URL:

python3 -m pip install https://huggingface.co/chcaa/da_dacy_large_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl

seems to work:

python3
>>> import dacy
>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
>>> doc = nlp('Hej verden.')
/usr/local/lib/python3.8/dist-packages/spacy/pipeline/attributeruler.py:108: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  matches = self.matcher(doc, allow_missing=True)
>>> doc._.trf_data.tensors[-1].shape
(1, 1024)

Add emoji classification to DaCy

This is going to be great -> 🎉
(might be doable with simply MLM)

replace augmenter with augmenty

The new spacy project for augmenters, augmenty, is a much more suitable place for augmenter than DaCy. The augmenter will thus be moved here instead.

Check potential issue with tokenization

Check if
å -> a

if that is the case

set:
"strip_accents": False

when training models (shouldn't be a big issue as the NN should model this discrepancy)

also check how the tokenization deals with emojis.

Add website with documentation

Using Sphinx and potentially using:
https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring

Import morphologizer

Currently, the model includes no morphologizer. This can easily be imported from the other danish models.

ContextualVersionConflict Traceback (most recent call last)

Moved from #133, originally posted by @EaLindhardt

I've tried to download dacy through anaconda, both with pip and conda install and the different ways of installing: https://centre-for-humanities-computing.github.io/DaCy/installation.html

when running

import dacy

i get the following

`---------------------------------------------------------------------------
ContextualVersionConflict Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 import dacy

File ~\AppData\Roaming\Python\Python39\site-packages\dacy_init_.py:4, in
1 from dacy.hate_speech import make_offensive_transformer # noqa
2 from dacy.sentiment import make_emotion_transformer # noqa
----> 4 from .about import download_url, title, version # noqa
5 from .download import download_model # noqa
6 from .load import load, models, where_is_my_dacy

File ~\AppData\Roaming\Python\Python39\site-packages\dacy\about.py:3, in
1 import pkg_resources
----> 3 version = pkg_resources.get_distribution("dacy").version
4 title = "dacy"
5 download_url = "https://github.com/centre-for-humanities-computing/DaCy"

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:477, in get_distribution(dist)
475 dist = Requirement.parse(dist)
476 if isinstance(dist, Requirement):
--> 477 dist = get_provider(dist)
478 if not isinstance(dist, Distribution):
479 raise TypeError("Expected string, Requirement, or Distribution", dist)

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:353, in get_provider(moduleOrReq)
351 """Return an IResourceProvider for the named module or requirement"""
352 if isinstance(moduleOrReq, Requirement):
--> 353 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
354 try:
355 module = sys.modules[moduleOrReq]

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:897, in WorkingSet.require(self, *requirements)
888 def require(self, *requirements):
889 """Ensure that distributions matching requirements are activated
890
891 requirements must be a string or a (possibly-nested) sequence
(...)
895 included, even if they were already activated in this working set.
896 """
--> 897 needed = self.resolve(parse_requirements(requirements))
899 for dist in needed:
900 self.add(dist)

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:788, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras)
785 if dist not in req:
786 # Oops, the "best" so far conflicts with a dependency
787 dependent_req = required_by[req]
--> 788 raise VersionConflict(dist, req).with_context(dependent_req)
790 # push the new requirements onto the stack
791 new_requirements = dist.requires(req.extras)[::-1]

ContextualVersionConflict: (spacy 3.3.1 (c:\users\au576018\anaconda3\lib\site-packages), Requirement.parse('spacy<3.3.0,>=3.2.0'), {'dacy'})`

How do I solve this?

@EaLindhardt will you please add the following information:

DaCy Version Used:
Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

you can also type python -m spacy info --markdown and copy-paste the result here along with the DaCy version, which you can get using python -c "import dacy; print(dacy.__version__)"

DaVADER: use lemmas instead of stems

@Guscode and @jacdals97 mentioned that they could extract the correct lemmas from the initial tagged data from SentiDa instead of using the stems. This will likely improve the current performance of DaVADER.

Add Danish synonym list and common misspellings.

Add twitter extension for extracting hashtags and mentions from a doc

and URLs

Make the project yml more efficient

Currently, the project yaml is fairly inefficient with more copy and paste that should be necessary. It might be ideal to cut down on this by adding arguments to workflows and calling the same functions. Assuming that is possible.

Add resource: Danish word frequencies DAGW

Add word frequencies estimated from Danish Gigaword as a language resource.

New language models to fine-tune

Add CI for testing tutorials

Automatically run pytests on PR
Automatically run action for code coverage.
Automatically upload to pipy (will be added in version 1)
Automatically run action for testing tutorials.

Upload names.csv to huggingface model hub

Morphologizer

Currently, the model includes no morphologizer

Installing DaCy downgrades huggingface-hub to 0.0.12, leading to incompatible load_dataset

How to reproduce the behaviour

Install daCy
from datasets import load_dataset

Gives me an error :-)

improve tokenisation for Danish

It might be relevant to improve the Danish tokenisation. E.g. the use of apostrophes in the following examples:

especially thanks to @martincjespersen for bringing this to my attention.

Add Coreference resolution model

Add the DaNLP and Twitter coref. models.

Improve lemmatization

Currently, the model used a lookup-based lemmatization on the training set. This can be improved by adapting the lemmy package for v. 3 of SpaCy

Another potential solution might be to use the lemmatization lstm from stanza. Which should be accessible using the spacy integration. However, it might not perform as well out of distribution.

generalise and optimise person augmenter to deal with all entities

currently, the person augmenter only deals with names it could just as well deal with loc, org etc.

similarly, it currently does two things. Replace person with a new name and augment the name format. Split these up to two functions.

Examine discrepancy between spaCy dependency parse and DaCy dependency parse

Examine a potential discrepancy between spaCy dependency parse and DaCy dependency parse noted by @rdkm89.

Update the classification transformer scripts to work with spacy-transformers 1.1.0 and above

spacy-transformers changed the API when going from 1.0.x to 1.1.x and the classification wrapper will need to be changed accordingly.

replace davader with asent

When the new version of asent is ready, replacing it with davader implementation in DaCy seems like a more stable solution.

Add LIT for the Danish Benchmark which the models are applied to.

LIT is a great tool for exploring model predictions. I would love to add a tutorial on using LIT with DaCy.

https://pair-code.github.io/lit/setup/

sunglasses wrapping a fine-tuned Tranformer

sunglasses wrapping a fine-tuned Tranformer is in tutorials section - missing an s.

Which page or section is this issue related to?

User configurable cache dir

I would like to configure the cache dir with an environmental variable instead of using the default ~/.dacy. This is a relatively simple change, but it useful if you want the cache to be in a known place (i.e. a specific folder in a docker image). I created a PR for this feature (#67). Some projects like to have associated issues so I am creating an issue as well.

Update testing workflow to work across multiple versions of python

Update testing workflow to work across multiple versions of python, like done in:
https://github.com/KennethEnevoldsen/asent/blob/main/.github/workflows/pytest-cov-comment.yml

No tests for tutorials

Currently, there are no CI tests of the tutorials.

adding wikiANN and plank

based on the evaluation by daLUKE it might be relevant to add wikiANN and plank as OOB datasets for NER.

Address cuda warnings and spaCy version warning.

When running:

import dacy

for model in dacy.models():
    print(model)

dacy_nlp = dacy.load('medium')

doc = dacy_nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering bygget i SpaCy.")

print('hej')

I get the following warning:


da_dacy_small_tft-0.0.0
da_dacy_medium_tft-0.0.0
da_dacy_large_tft-0.0.0
da_dacy_small_trf-0.1.0
da_dacy_medium_trf-0.1.0
da_dacy_large_trf-0.1.0
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_medium_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_small_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
/venv/lib/python3.9/site-packages/spacy/pipeline/attributeruler.py:150: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  matches = self.matcher(doc, allow_missing=True, as_spans=False)
hej

Notably this this includes three warning, including SpaCy version, cuda device and matcher object (see also #72)

originally version sent to me by mail

Note: While this is a warning there, DaCy still works as intended. The version of spaCy does not influence model performance.

Wrap DaNLP sentiment in Thinc

missing score

Score is missing from the documentation despite being well documented in code.

Cannot load as SentenceTransformer

I'm doing a project where I compare different Danish embedding models using the sentence-transformers library. However, when I tried to load the model (from HuggingFace) I get the following error:

WARNING:root:No sentence-transformers model found with name C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf. Creating a new one with MEAN pooling.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 89, in init
modules = self._load_auto_model(model_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 790, in _load_auto_model
transformer_model = Transformer(model_name_or_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\models\Transformer.py", line 28, in init
config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\models\auto\configuration_auto.py", line 527, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\configuration_utils.py", line 546, in get_config_dict
resolved_config_file = cached_path(
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\file_utils.py", line 1420, in cached_path
raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf\config.json as a URL or as a local path

The error persists both locally and in Google Colab. I can successfully load other models such as Ælæctra, which is why it might be a DaCy specific issue. If there's any way to fix this, please let me know :))

How to reproduce the behaviour

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("chcaa/da_dacy_small_trf")

Your Environment

DaCy Version Used: NA (only interacted through sentence-transformers 2.1.0)
Operating System: Windows 10
Python Version Used: 3.8.11
spaCy Version Used: NA

Move to Center for Humanities Computing

Move to the Center for humanities Computing organization. To officially show that it is under CHC support.

Augmentation

Performance differences between DaNLP and SpaCy benchmark

@martincjespersen notes that the SpaCy model performs slightly worse (but still SOTA) when using it is forced to use the original tokens of the DaNE corpus (for person NER). I assume this is how performance is tested using SpaCy's evaluate script as well so there seems to be a discrepancy between the two evaluate functions.

Testing DaNLP evaluate script against SpaCy scorer should resolve the issue.

Add sota NER model by Dan Nielsen

It should be possible using spacy-wrap 1.2.0.

Add a spelling correction module

Potentially using something like:
https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/

Another interesting read might be this blogpost by grammarly:
https://www.grammarly.com/blog/engineering/gec-tag-not-rewrite/

Here is might also be relevant to check out grammarly gector:
https://github.com/grammarly/gector

Potentially also check out:
https://github.com/PrithivirajDamodaran/Gramformer

Add table with model parameters, layers etc. in the training markdown

add force_redownload to dacy.load()

bias detection using augmentation

@martincjespersen brought to my attention that there is currently not check of biases in Danish NLP models. It is probably possible to use the augmentation which is already in development and check performance using using the spacy scorer.

augmentation issue:
#4

Add UDPipe to comparisons

For instance using the spacy implementation: https://github.com/TakeLab/spacy-udpipe

requirement versions too strict

Description

The versions specified for some requirements are too strict. Specifically, I am trying to install DaCy with dvc and got an error that the versions of tqdm are incompatible. For tqdm specifically, I don't think you need to specify a version because it's pretty hard to break it and it's a requirement way upstream. But as an example, I would use the smallest minor version that you want to the next major version, which is how spacy does it. I would do this with pandas as well.

Your Environment

DaCy Version Used: 1.2
Operating System: linux
Environment Information: installing with dvc (but this will happen quite a bit)

centre-for-humanities-computing / dacy Goto Github PK

dacy's Introduction

DaCy: An efficient and unified framework for danish NLP

🔧 Installation

👩‍💻 Usage

📖 Documentation

💬 Where to ask questions

dacy's People

Contributors

Stargazers

Watchers

Forkers

dacy's Issues

How to reproduce the behaviour

Your Environment

Note

How to reproduce the behaviour

Which page or section is this issue related to?

How to reproduce the behaviour

Your Environment

Description

Your Environment

Recommend Projects

Recommend Topics

Recommend Org