Giter Club home page Giter Club logo

Comments (13)

simongray avatar simongray commented on June 16, 2024

More relevant comments available in issue #15 (now closed).

from dannet.

simongray avatar simongray commented on June 16, 2024

See also #16 (connotations) which pertains to sentiment analysis work done by Sussi. These relations have also been left out for now.

from dannet.

simongray avatar simongray commented on June 16, 2024

Bart has created an SQL dump for the data that Sussi has produced.

I might be able to create an in-memory SQLite db, import the data into that, and then extract the needed table(s) using JDBC. Some more information: https://grishaev.me/en/clj-sqlite/

from dannet.

simongray avatar simongray commented on June 16, 2024

Currently attempting detective work on the SQL dump using a dockerized MySQL db:

# create and populate database
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:latest
docker exec -i some-mysql sh -c 'exec mysql -uroot -p"my-secret-pw"' < /Users/rqf595/Downloads/wordnetloom-wordnet.sql

# connect to container and database inside it
docker exec -it some-mysql /bin/bash
mysql -p wordnet

from dannet.

simongray avatar simongray commented on June 16, 2024

After running some SQL queries in the mysql shell, it appears that the tables

  • tbl_synset
  • tbl_synset_attributes
  • tbl_synset_relation
  • ... and possible tbl_relation_type

are the only relevant tables.

The actual synsets are linked using makeshift binary IDs generated in the software the Sussi used to created them with. The table tbl_synset_attributes includes two columns that bear witness to other ID types:

Skærmbillede 2022-12-15 kl  11 03 00

However, the ILI (Interlingual Index ID) isn't relevant unless we have links to DanNet from this ID.

from dannet.

simongray avatar simongray commented on June 16, 2024

Importing the Open English WordNet presents an interesting challenge as the dataset resource has a relation to every entry it encompasses: http://localhost:3456/dannet/external?subject=%3Chttps%3A%2F%2Fen-word.net%2F%3E

Another challenge is the fact that the dataset is quite minimal and doesn't have labels for any resources. The only label-like relation is for the canonicalForm.

from dannet.

simongray avatar simongray commented on June 16, 2024

Apparently, the original links to the Princeton wordnet are not included in the WordNetLoom data, so it will need to be imported via the old DanNet data and converted to Open English WordNet IDs.

https://github.com/globalwordnet/cili

from dannet.

simongray avatar simongray commented on June 16, 2024

Having looked more thoroughly into the two different types of IDs in the old link data, e.g.

  • production%1:23:00::
  • bundle%1:06:00::
  • equipment%1:06:00::

vs the more familiar ENG20-07523126-n which seems to be the ID type used in the cili repo, I have had some difficulties understanding how to get from these other IDs to ones which are mapped. The unfamiliar IDs are seemingly based on the complex, make-shift database of the WordNet project, and refer to lemmas present in multiple different files (mapped to integers) in the different WordNet releases. I haven't been able to find a translation table anywhere.

from dannet.

simongray avatar simongray commented on June 16, 2024

John McCrae was very helpful and wrote me the following guide:

Hi Simon,

These are sense keys, that are used to indicate the word in its synset (i.e., there is one sense key for each member of a synset). They are supposed to be more stable than the synset identifiers (but they aren't) and are preferred by the Princeton team. The full description of them is here:

https://wordnet.princeton.edu/documentation/senseidx5wn

They are quite tricky to calculate, OEWN has a whole script for doing it here:

https://github.com/globalwordnet/english-wordnet/blob/main/scripts/sense_keys.py

You can find all the sense keys and the relevant synsets in the src data for OEWN in the entries-*.yaml files, such as:

https://github.com/globalwordnet/english-wordnet/blob/main/src/yaml/entries-x.yaml

For Princeton WordNet releases, they are normally in a file called sense.index.

Regards,

John

I think the last link (the directory of YAML files) is what I need. I'll parse them all and build a mapping from these sense IDs to the OEWN synsets.

from dannet.

simongray avatar simongray commented on June 16, 2024

I have mapped the eq_synonym relations with senseidx in the 5000 old links, but not the remaining 123 as the GWA schema had no equivalent relations. I still need to map the wn20 IDs. Eventually, I should also link directly to the ILI instead by using the existing links in the OEWN.

As an aside, I think a companion dataset containing labels for the OEWN would be very valuable since the OEWN dataset currently doesn't contain any labels. This dataset can be generated based on the lexical forms present in the dataset (= lemmas).

from dannet.

simongray avatar simongray commented on June 16, 2024

In the process of linking DanNet to the Open English WordNet, I discovered a couple of errors in the OEWN dataset, one critical (ILI linking) and one less so:

from dannet.

simongray avatar simongray commented on June 16, 2024

Since the CILI resources do not have outgoing relations for the incoming relations, e.g. wn:ili, it really makes a lot of sense to implement #53 to make navigating navigating via the CILI feasible.

from dannet.

simongray avatar simongray commented on June 16, 2024

Getting ready to release most of the old English links as a sort of preview: 039ecc0

from dannet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.