centre-for-humanities-computing / dacy Goto Github PK
View Code? Open in Web Editor NEWDaCy: The State of the Art Danish NLP pipeline using SpaCy
Home Page: https://centre-for-humanities-computing.github.io/DaCy/
License: Apache License 2.0
DaCy: The State of the Art Danish NLP pipeline using SpaCy
Home Page: https://centre-for-humanities-computing.github.io/DaCy/
License: Apache License 2.0
Currently, there are no CI tests of the tutorials.
sunglasses wrapping a fine-tuned Tranformer is in tutorials section - missing an s.
Currently, the pos-tag resides in the .tag_ instead of the pos_ where it should (also) be.
Using Sphinx and potentially using:
https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring
from datasets import load_dataset
Gives me an error :-)
Currently, the model includes no morphologizer. This can easily be imported from the other danish models.
For instance using the spacy implementation: https://github.com/TakeLab/spacy-udpipe
Examine a potential discrepancy between spaCy dependency parse and DaCy dependency parse noted by @rdkm89.
Add word frequencies estimated from Danish Gigaword as a language resource.
and URLs
spacy-transformers changed the API when going from 1.0.x to 1.1.x and the classification wrapper will need to be changed accordingly.
Add the DaNLP and Twitter coref. models.
LIT is a great tool for exploring model predictions. I would love to add a tutorial on using LIT with DaCy.
This is going to be great -> 🎉
(might be doable with simply MLM)
The new spacy project for augmenters, augmenty, is a much more suitable place for augmenter than DaCy. The augmenter will thus be moved here instead.
>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
Defaulting to user installation because normal site-packages is not writeable
Collecting da-dacy-large-trf==any
ERROR: HTTP error 404 while getting https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl
ERROR: Could not install requirement da-dacy-large-trf==any from https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl because of HTTP error 404 Client Error: Not Found for url: https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl for URL https://huggingface.co/chcaa/da_dacy_medium_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl
I note that the subdirectory part "da_dacy_medium_trf" does not correspond to da_dacy_large_trf-any-py3-none-any.whl
Changing the subdirectory in the URL:
python3 -m pip install https://huggingface.co/chcaa/da_dacy_large_trf/resolve/main/da_dacy_large_trf-any-py3-none-any.whl
seems to work:
python3
>>> import dacy
>>> nlp = dacy.load("da_dacy_large_trf-0.1.0")
>>> doc = nlp('Hej verden.')
/usr/local/lib/python3.8/dist-packages/spacy/pipeline/attributeruler.py:108: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
matches = self.matcher(doc, allow_missing=True)
>>> doc._.trf_data.tensors[-1].shape
(1, 1024)
After removing readability it would be nice with a tutorial on: "Extracting text statistics and readability metrics using DaCy and Textdescriptives"
Potentially using the packages to describe the examining the language complexity between conversational data and legal documents on DAGW or a similar task using a publicly available dataset.
It should be possible using spacy-wrap 1.2.0.
I would like to configure the cache dir with an environmental variable instead of using the default ~/.dacy
. This is a relatively simple change, but it useful if you want the cache to be in a known place (i.e. a specific folder in a docker image). I created a PR for this feature (#67). Some projects like to have associated issues so I am creating an issue as well.
Add a community discussion board.
When running:
import dacy
for model in dacy.models():
print(model)
dacy_nlp = dacy.load('medium')
doc = dacy_nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering bygget i SpaCy.")
print('hej')
I get the following warning:
da_dacy_small_tft-0.0.0
da_dacy_medium_tft-0.0.0
da_dacy_large_tft-0.0.0
da_dacy_small_trf-0.1.0
da_dacy_medium_trf-0.1.0
da_dacy_large_trf-0.1.0
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_medium_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy/util.py:833: UserWarning: [W095] Model 'da_dacy_small_trf' (0.1.0) was trained with spaCy v3.1 and may not be 100% compatible with the current version (3.2.4). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/spacy_transformers/pipeline_component.py:406: UserWarning: Automatically converting a transformer component from spacy-transformers v1.0 to v1.1+. If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spacy-transformers version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
/venv/lib/python3.9/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
/venv/lib/python3.9/site-packages/spacy/pipeline/attributeruler.py:150: UserWarning: [W036] The component 'matcher' does not have any patterns defined.
matches = self.matcher(doc, allow_missing=True, as_spans=False)
hej
Notably this this includes three warning, including SpaCy version, cuda device and matcher object (see also #72)
originally version sent to me by mail
Note: While this is a warning there, DaCy still works as intended. The version of spaCy does not influence model performance.
Moved from #133, originally posted by @EaLindhardt
I've tried to download dacy through anaconda, both with pip and conda install and the different ways of installing: https://centre-for-humanities-computing.github.io/DaCy/installation.html
when running
import dacy
i get the following
`---------------------------------------------------------------------------
ContextualVersionConflict Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 import dacyFile ~\AppData\Roaming\Python\Python39\site-packages\dacy_init_.py:4, in
1 from dacy.hate_speech import make_offensive_transformer # noqa
2 from dacy.sentiment import make_emotion_transformer # noqa
----> 4 from .about import download_url, title, version # noqa
5 from .download import download_model # noqa
6 from .load import load, models, where_is_my_dacyFile ~\AppData\Roaming\Python\Python39\site-packages\dacy\about.py:3, in
1 import pkg_resources
----> 3 version = pkg_resources.get_distribution("dacy").version
4 title = "dacy"
5 download_url = "https://github.com/centre-for-humanities-computing/DaCy"File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:477, in get_distribution(dist)
475 dist = Requirement.parse(dist)
476 if isinstance(dist, Requirement):
--> 477 dist = get_provider(dist)
478 if not isinstance(dist, Distribution):
479 raise TypeError("Expected string, Requirement, or Distribution", dist)File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:353, in get_provider(moduleOrReq)
351 """Return an IResourceProvider for the named module or requirement"""
352 if isinstance(moduleOrReq, Requirement):
--> 353 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
354 try:
355 module = sys.modules[moduleOrReq]File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:897, in WorkingSet.require(self, *requirements)
888 def require(self, *requirements):
889 """Ensure that distributions matching requirements are activated
890
891 requirements must be a string or a (possibly-nested) sequence
(...)
895 included, even if they were already activated in this working set.
896 """
--> 897 needed = self.resolve(parse_requirements(requirements))
899 for dist in needed:
900 self.add(dist)File ~\Anaconda3\lib\site-packages\pkg_resources_init_.py:788, in WorkingSet.resolve(self, requirements, env, installer, replace_conflicting, extras)
785 if dist not in req:
786 # Oops, the "best" so far conflicts with a dependency
787 dependent_req = required_by[req]
--> 788 raise VersionConflict(dist, req).with_context(dependent_req)
790 # push the new requirements onto the stack
791 new_requirements = dist.requires(req.extras)[::-1]ContextualVersionConflict: (spacy 3.3.1 (c:\users\au576018\anaconda3\lib\site-packages), Requirement.parse('spacy<3.3.0,>=3.2.0'), {'dacy'})`
How do I solve this?
@EaLindhardt will you please add the following information:
you can also type python -m spacy info --markdown
and copy-paste the result here along with the DaCy version, which you can get using python -c "import dacy; print(dacy.__version__)"
Currently, the training scripts do not train with ray, which would make the training much faster. However, seemingly enabling ray makes the model not train at all.
Check if
å -> a
if that is the case
set:
"strip_accents": False
when training models (shouldn't be a big issue as the NN should model this discrepancy)
also check how the tokenization deals with emojis.
@Guscode and @jacdals97 mentioned that they could extract the correct lemmas from the initial tagged data from SentiDa instead of using the stems. This will likely improve the current performance of DaVADER.
Move to the Center for humanities Computing organization. To officially show that it is under CHC support.
Potentially using something like:
https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/
Another interesting read might be this blogpost by grammarly:
https://www.grammarly.com/blog/engineering/gec-tag-not-rewrite/
Here is might also be relevant to check out grammarly gector:
https://github.com/grammarly/gector
Potentially also check out:
https://github.com/PrithivirajDamodaran/Gramformer
Score is missing from the documentation despite being well documented in code.
The versions specified for some requirements are too strict. Specifically, I am trying to install DaCy with dvc and got an error that the versions of tqdm
are incompatible. For tqdm
specifically, I don't think you need to specify a version because it's pretty hard to break it and it's a requirement way upstream. But as an example, I would use the smallest minor version that you want to the next major version, which is how spacy does it. I would do this with pandas
as well.
I'm doing a project where I compare different Danish embedding models using the sentence-transformers library. However, when I tried to load the model (from HuggingFace) I get the following error:
WARNING:root:No sentence-transformers model found with name C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf. Creating a new one with MEAN pooling.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 89, in init
modules = self._load_auto_model(model_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\SentenceTransformer.py", line 790, in _load_auto_model
transformer_model = Transformer(model_name_or_path)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\sentence_transformers\models\Transformer.py", line 28, in init
config = AutoConfig.from_pretrained(model_name_or_path, **model_args, cache_dir=cache_dir)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\models\auto\configuration_auto.py", line 527, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\configuration_utils.py", line 546, in get_config_dict
resolved_config_file = cached_path(
File "C:\Users\jhr\Anaconda3\envs\bertopic_explore\lib\site-packages\transformers\file_utils.py", line 1420, in cached_path
raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse C:\Users\jhr/.cache\torch\sentence_transformers\chcaa_da_dacy_small_trf\config.json as a URL or as a local path
The error persists both locally and in Google Colab. I can successfully load other models such as Ælæctra, which is why it might be a DaCy specific issue. If there's any way to fix this, please let me know :))
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("chcaa/da_dacy_small_trf")
When the new version of asent is ready, replacing it with davader implementation in DaCy seems like a more stable solution.
based on the evaluation by daLUKE it might be relevant to add wikiANN and plank as OOB datasets for NER.
DaCy contains a few readability measures. The package textdescriptives contains a much better implementation of readability measures. Thus we recommend using
@martincjespersen brought to my attention that there is currently not check of biases in Danish NLP models. It is probably possible to use the augmentation which is already in development and check performance using using the spacy scorer.
augmentation issue:
#4
@martincjespersen notes that the SpaCy model performs slightly worse (but still SOTA) when using it is forced to use the original tokens of the DaNE corpus (for person NER). I assume this is how performance is tested using SpaCy's evaluate script as well so there seems to be a discrepancy between the two evaluate functions.
Testing DaNLP evaluate script against SpaCy scorer should resolve the issue.
Currently, the model includes no morphologizer
Currently, the project yaml is fairly inefficient with more copy and paste that should be necessary. It might be ideal to cut down on this by adding arguments to workflows and calling the same functions. Assuming that is possible.
It might be relevant to improve the Danish tokenisation. E.g. the use of apostrophes in the following examples:
especially thanks to @martincjespersen for bringing this to my attention.
currently, the person augmenter only deals with names it could just as well deal with loc, org etc.
similarly, it currently does two things. Replace person with a new name and augment the name format. Split these up to two functions.
Currently, the model used a lookup-based lemmatization on the training set. This can be improved by adapting the lemmy
package for v. 3 of SpaCy
Another potential solution might be to use the lemmatization lstm from stanza. Which should be accessible using the spacy integration. However, it might not perform as well out of distribution.
Update testing workflow to work across multiple versions of python, like done in:
https://github.com/KennethEnevoldsen/asent/blob/main/.github/workflows/pytest-cov-comment.yml
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.