Comments (12)
If it is of any use, there are now close to 50,000 representation-lemma relationships in Wikidata: https://w.wiki/457J
from dacy.
There are now 115,204 representation-lemma relationships in Wikidata: https://w.wiki/457J
from dacy.
An alternative approach is to use the new neural edit trees.
from dacy.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
from dacy.
This issue was closed because it has been stalled for 5 days with no activity.
from dacy.
Wonderful thanks @fnielsen. Actually the latest version (will probably be released today or tomorrow) actually improves the lemmatisation by quite a margin (all the way up to ~95%) using neural edit trees.
However I know that the greek model odycy uses a hybrid approach. Might want to try that out as the next thing.
Edit: The new 0.2.0 models are now up!
from dacy.
Actually @fnielsen if I were to integrate with the Danish word registries would recommend doing to using wikidata?
Some considerations I have:
- Would there be things I couldn't do using wikidata? I.e. is there relevant metadata that I might want to include?
- How well would it transfer to e.g. Norwegian and Swedish (to allow for generalizations of DaCy)
It is fine if you don't know the answer, you simply seem to have more expertise on this than I do. I would generally prefer using wikidata, but I am unsure what the tradeoffs are. Hoping you can help me
from dacy.
Bokmål and Swedish are currently larger than Danish wrt. Wikidata lexemes, see https://ordia.toolforge.org/language/ : 40,862 Swedish lemmas, 32,431 Bokmål lemmas, 21,583 Danish lemmas, and 15,036 Nynorsk lemmas. Perhaps that is not sufficient for a good coverage. Wrt. to forms there are, e.g., 282,378 Swedish forms in Wikidata.
I plan to copy most forms from Det Centrale Ordregister (COR) https://ordregister.dk/ to Wikidata so The Danish lexemes on Wikidata should be around 100,000.
I should think that any information there is in COR would also be in Wikidata. And there would be further metadata.
One issue though is what a lemma is, e.g., to "understimuleret": Is that an adjective (lemma understimuleret) or a verbform (lemma understimulere), see https://openreview.net/pdf?id=kvEmQxxAab I would tend to see "understimuleret" as an adjective.
from dacy.
Thanks @fnielsen. Coverage isn't too much of a problem - we can do a fallback strategy where you first do the lookup and if that fails, then if that fails (e.g. for new words) we can fall back to the neural edit tree.
from dacy.
This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.
from dacy.
This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.
from dacy.
This issue was closed automatically. Feel free to re-open it if it's important.
from dacy.
Related Issues (20)
- Add sota NER model by Dan Nielsen HOT 1
- Make DaCy compatible with spaCy 3.4+ HOT 1
- Remove protobuf dependency HOT 4
- protobuf conflicting version issue HOT 1
- Relax tqdm version constraint? HOT 4
- Train or fine-tune sentiment model HOT 2
- ImportError for lemmy.pipe HOT 2
- Update to the newest version of the DDT dataset. HOT 3
- tft-models? HOT 3
- DaCy large requires "protobuf>=3.17.3,<3.18.0" to run HOT 4
- Fix such that the latest dacy model in downloaded only once (now it downloads it everytime) HOT 1
- Update citation.cff to refer to paper HOT 1
- Models to try out HOT 3
- Try out multilingual fewNERD for Danish HOT 2
- POS-tagging update HOT 2
- Add POS-tagging benchmark HOT 1
- Create a benchmark page for POS-tagging. HOT 2
- Create POS-dataset (and make a function for downloading it from HF and converting it to spacy format) HOT 2
- Add a specialized model only for POS tagging HOT 2
- Retrain models with updated dependencies HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dacy.