Currently, the model used a lookup-based lemmatization on the training set. This can b

An alternative approach is to use <a href="https://explosion.ai/blog/edit-tree-lemmati

Wonderful thanks <a class="user-mention notranslate" data-hovercard-type="user" data-h

Actually <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Improve lemmatization,about centre-for-humanities-computing/dacy

Comments (12)

fnielsen commented on June 26, 2024 1

If it is of any use, there are now close to 50,000 representation-lemma relationships in Wikidata: https://w.wiki/457J

from dacy.

fnielsen commented on June 26, 2024 1

There are now 115,204 representation-lemma relationships in Wikidata: https://w.wiki/457J

from dacy.

KennethEnevoldsen commented on June 26, 2024

An alternative approach is to use the new neural edit trees.

from dacy.

github-actions commented on June 26, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from dacy.

github-actions commented on June 26, 2024

This issue was closed because it has been stalled for 5 days with no activity.

from dacy.

KennethEnevoldsen commented on June 26, 2024

Wonderful thanks @fnielsen. Actually the latest version (will probably be released today or tomorrow) actually improves the lemmatisation by quite a margin (all the way up to ~95%) using neural edit trees.

However I know that the greek model odycy uses a hybrid approach. Might want to try that out as the next thing.

Edit: The new 0.2.0 models are now up!

from dacy.

KennethEnevoldsen commented on June 26, 2024

Actually @fnielsen if I were to integrate with the Danish word registries would recommend doing to using wikidata?

Some considerations I have:

Would there be things I couldn't do using wikidata? I.e. is there relevant metadata that I might want to include?
How well would it transfer to e.g. Norwegian and Swedish (to allow for generalizations of DaCy)

It is fine if you don't know the answer, you simply seem to have more expertise on this than I do. I would generally prefer using wikidata, but I am unsure what the tradeoffs are. Hoping you can help me

from dacy.

fnielsen commented on June 26, 2024

Bokmål and Swedish are currently larger than Danish wrt. Wikidata lexemes, see https://ordia.toolforge.org/language/ : 40,862 Swedish lemmas, 32,431 Bokmål lemmas, 21,583 Danish lemmas, and 15,036 Nynorsk lemmas. Perhaps that is not sufficient for a good coverage. Wrt. to forms there are, e.g., 282,378 Swedish forms in Wikidata.

I plan to copy most forms from Det Centrale Ordregister (COR) https://ordregister.dk/ to Wikidata so The Danish lexemes on Wikidata should be around 100,000.

I should think that any information there is in COR would also be in Wikidata. And there would be further metadata.

One issue though is what a lemma is, e.g., to "understimuleret": Is that an adjective (lemma understimuleret) or a verbform (lemma understimulere), see https://openreview.net/pdf?id=kvEmQxxAab I would tend to see "understimuleret" as an adjective.

from dacy.

KennethEnevoldsen commented on June 26, 2024

Thanks @fnielsen. Coverage isn't too much of a problem - we can do a fallback strategy where you first do the lookup and if that fails, then if that fails (e.g. for new words) we can fall back to the neural edit tree.

from dacy.

github-actions commented on June 26, 2024

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

from dacy.

github-actions commented on June 26, 2024

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

from dacy.

github-actions commented on June 26, 2024

This issue was closed automatically. Feel free to re-open it if it's important.

from dacy.

Improve lemmatization about dacy HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent