Giter Club home page Giter Club logo

Comments (12)

fnielsen avatar fnielsen commented on June 26, 2024 1

If it is of any use, there are now close to 50,000 representation-lemma relationships in Wikidata: https://w.wiki/457J

from dacy.

fnielsen avatar fnielsen commented on June 26, 2024 1

There are now 115,204 representation-lemma relationships in Wikidata: https://w.wiki/457J

from dacy.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 26, 2024

An alternative approach is to use the new neural edit trees.

from dacy.

github-actions avatar github-actions commented on June 26, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from dacy.

github-actions avatar github-actions commented on June 26, 2024

This issue was closed because it has been stalled for 5 days with no activity.

from dacy.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 26, 2024

Wonderful thanks @fnielsen. Actually the latest version (will probably be released today or tomorrow) actually improves the lemmatisation by quite a margin (all the way up to ~95%) using neural edit trees.

However I know that the greek model odycy uses a hybrid approach. Might want to try that out as the next thing.

Edit: The new 0.2.0 models are now up!

from dacy.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 26, 2024

Actually @fnielsen if I were to integrate with the Danish word registries would recommend doing to using wikidata?

Some considerations I have:

  1. Would there be things I couldn't do using wikidata? I.e. is there relevant metadata that I might want to include?
  2. How well would it transfer to e.g. Norwegian and Swedish (to allow for generalizations of DaCy)

It is fine if you don't know the answer, you simply seem to have more expertise on this than I do. I would generally prefer using wikidata, but I am unsure what the tradeoffs are. Hoping you can help me

from dacy.

fnielsen avatar fnielsen commented on June 26, 2024

Bokmål and Swedish are currently larger than Danish wrt. Wikidata lexemes, see https://ordia.toolforge.org/language/ : 40,862 Swedish lemmas, 32,431 Bokmål lemmas, 21,583 Danish lemmas, and 15,036 Nynorsk lemmas. Perhaps that is not sufficient for a good coverage. Wrt. to forms there are, e.g., 282,378 Swedish forms in Wikidata.

I plan to copy most forms from Det Centrale Ordregister (COR) https://ordregister.dk/ to Wikidata so The Danish lexemes on Wikidata should be around 100,000.

I should think that any information there is in COR would also be in Wikidata. And there would be further metadata.

One issue though is what a lemma is, e.g., to "understimuleret": Is that an adjective (lemma understimuleret) or a verbform (lemma understimulere), see https://openreview.net/pdf?id=kvEmQxxAab I would tend to see "understimuleret" as an adjective.

from dacy.

KennethEnevoldsen avatar KennethEnevoldsen commented on June 26, 2024

Thanks @fnielsen. Coverage isn't too much of a problem - we can do a fallback strategy where you first do the lookup and if that fails, then if that fails (e.g. for new words) we can fall back to the neural edit tree.

from dacy.

github-actions avatar github-actions commented on June 26, 2024

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

from dacy.

github-actions avatar github-actions commented on June 26, 2024

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

from dacy.

github-actions avatar github-actions commented on June 26, 2024

This issue was closed automatically. Feel free to re-open it if it's important.

from dacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.