Giter Club home page Giter Club logo

dacy's Introduction

DaCy: An efficient and unified framework for danish NLP

PyPI pip downloads Python Version Ruff documentation Tests

DaCy is a Danish natural language preprocessing framework made with SpaCy. Its largest pipeline has achieved State-of-the-Art performance on Named entity recognition, part-of-speech tagging and dependency parsing for Danish. Feel free to try out the demo. This repository contains material for using DaCy, reproducing the results and guides on the usage of the package. Furthermore, it also contains behavioral tests for biases and robustness of Danish NLP pipelines.

πŸ”§ Installation

You can install dacy via pip from PyPI:

pip install dacy

πŸ‘©β€πŸ’» Usage

To use the model you first have to download either the small, medium, or large model. To see a list of all available models:

import dacy
for model in dacy.models():
    print(model)
# ...
# da_dacy_small_trf-0.2.0
# da_dacy_medium_trf-0.2.0
# da_dacy_large_trf-0.2.0

To download and load a model simply execute:

nlp = dacy.load("da_dacy_medium_trf-0.2.0")
# or equivalently (always loads the latest version)
nlp = dacy.load("medium")

To see more examples, see the documentation.

πŸ“– Documentation

Documentation
πŸ“š Getting started Guides and instructions on how to use DaCy and its features.
🦾 Performance A detailed description of the performance of DaCy and comparison with similar Danish models
πŸ“° News and changelog New additions, changes and version history.
πŸŽ› API References The detailed reference for DaCy's API. Including function documentation
πŸ™‹ FAQ Frequently asked questions

Training and reproduction

The folder training contains a range of folders with a SpaCy project for each model version. This allows for the reproduction of the results.

Want to learn more about how DaCy initially came to be, check out this blog post.


πŸ’¬ Where to ask questions

To report issues or request features, please use the GitHub Issue Tracker. Questions related to SpaCy are kindly referred to the SpaCy GitHub or forum. Otherwise, please use the Discussion Forums.

Type
πŸ“š FAQ FAQ
🚨 Bug Reports GitHub Issue Tracker
🎁 Feature Requests & Ideas GitHub Issue Tracker
πŸ‘©β€πŸ’» Usage Questions GitHub Discussions
πŸ—― General Discussion GitHub Discussions

dacy's People

Contributors

actions-user avatar dependabot[bot] avatar emiltj avatar github-actions[bot] avatar hlasse avatar ines avatar julien-c avatar kasperfyhn avatar kennethenevoldsen avatar maltehb avatar martinbernstorff avatar pre-commit-ci[bot] avatar sarakolding avatar smaakage85 avatar sorenmulli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dacy's Issues

missing score

Score is missing from the documentation despite being well documented in code.

Change script to train with ray

Currently, the training scripts do not train with ray, which would make the training much faster. However, seemingly enabling ray makes the model not train at all.

Make the project yml more efficient

Currently, the project yaml is fairly inefficient with more copy and paste that should be necessary. It might be ideal to cut down on this by adding arguments to workflows and calling the same functions. Assuming that is possible.

Import morphologizer

Currently, the model includes no morphologizer. This can easily be imported from the other danish models.

Improve lemmatization

Currently, the model used a lookup-based lemmatization on the training set. This can be improved by adapting the lemmy package for v. 3 of SpaCy

Another potential solution might be to use the lemmatization lstm from stanza. Which should be accessible using the spacy integration. However, it might not perform as well out of distribution.

ContextualVersionConflict Traceback (most recent call last)

Moved from #133, originally posted by @EaLindhardt

I've tried to download dacy through anaconda, both with pip and conda install and the different ways of installing: https://centre-for-humanities-computing.github.io/DaCy/installation.html

when running

import dacy

i get the following

`---------------------------------------------------------------------------
ContextualVersionConflict Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 import dacy

File ~\AppData\Roaming\Python\Python39\site-packages\dacy_init_.py:4, in
1 from dacy.hate_speech import make_offensive_transformer # noqa
2 from dacy.sentiment import make_emotion_transformer # noqa
----> 4 from .about import download_url, title, version # noqa
5 from .download import download_model # noqa
6 from .load import load, models, where_is_my_dacy

File ~\AppData\Roaming\Python\Python39\site-packages\dacy\about.py:3, in
1 import pkg_resources
----> 3 version = pkg_resources.get_distribution("dacy").version
4 title = "dacy"
5 download_url = "https://github.com/centre-for-humanities-computing/DaCy"

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:477, in get_distribution(dist)
475 dist = Requirement.parse(dist)
476 if isinstance(dist, Requirement):
--> 477 dist = get_provider(dist)
478 if not isinstance(dist, Distribution):
479 raise TypeError("Expected string, Requirement, or Distribution", dist)

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:353, in get_provider(moduleOrReq)
351 """Return an IResourceProvider for the named module or requirement"""
352 if isinstance(moduleOrReq, Requirement):
--> 353 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
354 try:
355 module = sys.modules[moduleOrReq]

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:897, in WorkingSet.require(self, *requirements)
888 def require(self, *requirements):
889 """Ensure that distributions matching requirements are activated
890
891 requirements must be a string or a (possibly-nested) sequence
(...)
895 included, even if they were already activated in this working set.
896 """
--> 897 needed = self.resolve(parse_requirements(requirements))
899 for dist in needed:
900 self.add(dist)

File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:788, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras)
785 if dist not in req:
786 # Oops, the "best" so far conflicts with a dependency
787 dependent_req = required_by[req]
--> 788 raise VersionConflict(dist, req).with_context(dependent_req)
790 # push the new requirements onto the stack
791 new_requirements = dist.requires(req.extras)[::-1]

ContextualVersionConflict: (spacy 3.3.1 (c:\users\au576018\anaconda3\lib\site-packages), Requirement.parse('spacy<3.3.0,>=3.2.0'), {'dacy'})`

How do I solve this?

@EaLindhardt will you please add the following information:

  • DaCy Version Used:
  • Operating System:
  • Python Version Used:
  • spaCy Version Used:
  • Environment Information:

you can also type python -m spacy info --markdown and copy-paste the result here along with the DaCy version, which you can get using python -c "import dacy; print(dacy.__version__)"

Address cuda warnings and spaCy version warning.

When running:

import dacy

for model in dacy.models():
    print(model)

dacy_nlp = dacy.load('medium')

doc = dacy_nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering bygget i SpaCy.")

print('hej')

I get the following warning:


da_dacy_small_tft-0.0.0
da_dacy_medium_tft-0.0.0
da_dacy_large_tft-0.0.0
da_dacy_small_trf-0.1.0
da_dacy_medium_trf-0.1.0
da_dacy_large_trf-0.1.0
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_medium_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_small_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
/venv/lib/python3.9/site-packages/spacy/pipeline/attributeruler.py:150: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  matches = self.matcher(doc, allow_missing=True, as_spans=False)
hej

Notably this this includes three warning, including SpaCy version, cuda device and matcher object (see also #72)

originally version sent to me by mail

Note: While this is a warning there, DaCy still works as intended. The version of spaCy does not influence model performance.

Add CI for testing tutorials

  • Automatically run pytests on PR
  • Automatically run action for code coverage.
  • Automatically upload to pipy (will be added in version 1)
  • Automatically run action for testing tutorials.

requirement versions too strict

Description

The versions specified for some requirements are too strict. Specifically, I am trying to install DaCy with dvc and got an error that the versions of tqdm are incompatible. For tqdm specifically, I don't think you need to specify a version because it's pretty hard to break it and it's a requirement way upstream. But as an example, I would use the smallest minor version that you want to the next major version, which is how spacy does it. I would do this with pandas as well.

Your Environment

  • DaCy Version Used: 1.2
  • Operating System: linux
  • Environment Information: installing with dvc (but this will happen quite a bit)

Augmentation

  • Entity augmentation
    • Gender augmentation (awareness of gender)
    • Second order person augmentation (Lastname, Firstname)
    • Usernames (autogenerates e.g. WhiteTruffle101 or Kenneth Enevoldsen -> KennethEnevoldsen)
  • Mispellings Augmentations, se e.g. this repo
    • Keystroke error based on keyboard distance
  • Historic augmentations
    • Β  Γ¦->ae, Γ₯ -> aa (and a), ΓΈ->oe
    • uppercasing of nouns
  • Social media
    • Adding hashtags augmentation
  • Others, potentially see this tweet or this kaggle summary

Cannot load as SentenceTransformer

I'm doing a project where I compare different Danish embedding models using the sentence-transformers library. However, when I tried to load the model (from HuggingFace) I get the following error:

WARNING:root:No sentence-transformers model found with name C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf. Creating a new one with MEAN pooling.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 89, in init
modules = self._load_auto_model(model_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 790, in _load_auto_model
transformer_model = Transformer(model_name_or_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\models\Transformer.py", line 28, in init
config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\models\auto\configuration_auto.py", line 527, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\configuration_utils.py", line 546, in get_config_dict
resolved_config_file = cached_path(
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\file_utils.py", line 1420, in cached_path
raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf\config.json as a URL or as a local path

The error persists both locally and in Google Colab. I can successfully load other models such as Ælæctra, which is why it might be a DaCy specific issue. If there's any way to fix this, please let me know :))

How to reproduce the behaviour

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("chcaa/da_dacy_small_trf")

Your Environment

  • DaCy Version Used: NA (only interacted through sentence-transformers 2.1.0)
  • Operating System: Windows 10
  • Python Version Used: 3.8.11
  • spaCy Version Used: NA

bias detection using augmentation

@martincjespersen brought to my attention that there is currently not check of biases in Danish NLP models. It is probably possible to use the augmentation which is already in development and check performance using using the spacy scorer.

augmentation issue:
#4

removing readability

DaCy contains a few readability measures. The package textdescriptives contains a much better implementation of readability measures. Thus we recommend using

Performance differences between DaNLP and SpaCy benchmark

@martincjespersen notes that the SpaCy model performs slightly worse (but still SOTA) when using it is forced to use the original tokens of the DaNE corpus (for person NER). I assume this is how performance is tested using SpaCy's evaluate script as well so there seems to be a discrepancy between the two evaluate functions.

Testing DaNLP evaluate script against SpaCy scorer should resolve the issue.

Pos tags

Currently, the pos-tag resides in the .tag_ instead of the pos_ where it should (also) be.

replace davader with asent

When the new version of asent is ready, replacing it with davader implementation in DaCy seems like a more stable solution.

Download of large transformer model 0.1.0 does not work

How to reproduce the behaviour

>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
Defaulting to user installation because normal site-packages is not writeable
Collecting da-dacy-large-trf==any
  ERROR: HTTP error 404 while getting https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl
ERROR: Could not install requirement da-dacy-large-trf==any from https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl because of HTTP error 404 Client Error: Not Found for url: https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl for URL https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl

Your Environment

  • DaCy Version Used: 1.1.3
  • Operating System: Pop!_OS 20.04 LTS
  • Python Version Used: 3.8.10
  • spaCy Version Used: 3.1.2
  • Environment Information:

Note

I note that the subdirectory part "da_dacy_medium_trf" does not correspond to da_dacy_large_trf-any-py3-none-any.whl

Changing the subdirectory in the URL:

python3 -m pip install https://huggingface.co/chcaa/da_dacy_large_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl 

seems to work:

python3
>>> import dacy
>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
>>> doc = nlp('Hej verden.')
/usr/local/lib/python3.8/dist-packages/spacy/pipeline/attributeruler.py:108: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  matches = self.matcher(doc, allow_missing=True)
>>> doc._.trf_data.tensors[-1].shape
(1, 1024)

Check potential issue with tokenization

Check if
Γ₯ -> a

if that is the case

set:
"strip_accents": False

when training models (shouldn't be a big issue as the NN should model this discrepancy)

also check how the tokenization deals with emojis.

New language models to fine-tune

  • ConvBERT small
  • Γ†lΓ¦ctra Cased
  • Γ†lΓ¦ctra uncased
  • ELECTRA
  • ConvBERT medium
    • Won't the trained as the small convBERT did not compete favorably against the ELECTRA models.
  • Multilingual distilBERT
  • mDeBERTa v3
  • Β Nb-bert-large
  • nb-bert-base

User configurable cache dir

I would like to configure the cache dir with an environmental variable instead of using the default ~/.dacy. This is a relatively simple change, but it useful if you want the cache to be in a known place (i.e. a specific folder in a docker image). I created a PR for this feature (#67). Some projects like to have associated issues so I am creating an issue as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.