Comments (8)
@ekaf I think the CILI as a resource is better thought of as the inventory of identifiers and their definitions than as a collection of mappings. The mappings to synsets should be maintained by the respective wordnet projects, although in practice we keep some mapping files in the CILI repository. Those mappings files are used when creating the WN-LMF exports of the PWN.
Let me try to describe ILI support in Wn. WN-LMF lexicons can link synsets to individual ILIs like this (example from OEWN 2021):
<Synset id="oewn-15307914-n" ili="i117563" members="oewn-speed-n oewn-velocity-n" partOfSpeech="n" dc:subject="noun.time">
~~~~~~~~~~~~~
These ILIs are stored in Wn's database linked to the synsets. When a second lexicon is loaded containing synsets with the same ILIs, such as this (from OMW 1.4's Spanish wordnet):
<Synset id="omw-es-15282696-n" ili="i117563" partOfSpeech="n" members="omw-es-velocidad-15282696-n" />
~~~~~~~~~~~~~
... then Wn is able to use the shared ILI to link the synsets across lexicons for translation or expanded relation traversal. Another thing we see is synsets with the special ILI in
, which indicates that that version of the lexicon is proposing the synset as a candidate for a new ILI. For example:
<Synset id="oewn-90002921-n" ili="in" members="oewn-snow_day-n" partOfSpeech="n" dc:subject="noun.time" dc:source="Colloquial WordNet">
~~~~~~~~
These proposed ILIs are not used for translation or expanded relation traversals. In Wn, the ILIs are represented by a class with an id, a status, and a definition. For example (here, the cili
project has not been loaded in Wn):
>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> oewn.synsets('velocity')[0].ili.id # an explicit ID
'i117563'
>>> oewn.synsets('velocity')[0].ili.status
'presupposed'
>>> oewn.synsets('velocity')[0].ili.definition()
>>> oewn.synsets('snow day')[0].ili.id # ili="in" is special and the ID is None in Wn
>>> oewn.synsets('snow day')[0].ili.status
'proposed'
>>> oewn.synsets('snow day')[0].ili.definition()
'a day on which school or other events are cancelled due to snow'
Note:
- The status
presupposed
means that the synset has an explicit ILI but there is no authoritative source to say whether the ILI is valid or not. The statusproposed
means that the lexicon used the special ILIin
. - Explicit ILIs do not have ILI definitions in the lexicon, but proposed ILIs do. Note that ILI definitions are separate from synset definitions.
When the cili
resource has been loaded, the presupposed
statuses can change and their definitions become available:
>>> wn.download('cili')
...
>>> oewn.synsets('velocity')[0].ili.status
'active'
>>> oewn.synsets('velocity')[0].ili.definition()
'distance travelled per unit time'
The cili
resource that is added here contains only a list of ILIs and their definitions (and maybe statuses in a future version: globalwordnet/cili#8), and does not contain any mappings to PWN 3.0 or 3.1 synsets.
Does that help?
from wn.
Thanks, I think I see the problem, but let me make sure I got it right: there is a gap in ILI-based translation coverage when the target synset (and thus its ILI) has been merged into another. In this case, PWN 3.0 (and OMW lexicons expanded from it) have two synsets, but in PWN 3.1 and OEWN they are merged into a single synset.
Due to the way Wn applies the ILI mappings
There seems to be a mistaken assumption here. Wn does not use the ILI mappings that you are referring to. The only resource from https://github.com/globalwordnet/cili/ that it uses (and only if you've downloaded it) is the released CILI inventory which includes the ILI identifiers and definitions. Inter-lexicon relationships via shared ILIs are identified only by the ili
attribute on <Synset>
elements in WN-LMF lexicons. This attribute's value is limited to a single ILI, so there is a technical limitation that we cannot map multiple ILIs to a synset. This also follows the theoretical constraint that ILIs should be mapped to no more than one synset, and vice versa, within a lexicon.
Therefore, I disagree that there is something here incorrect in Wn, but I do recognize how things could be improved. A satisfactory solution to this issue is thus not so much a bug fix as a new feature: to store (or identify) and subsequently use changes to synset-ILI mappings across versions. This sounds appealing but I also feel like it will be hard to do correctly in a transparent fashion (e.g., when calling Synset.translate()
) rather than as a discrete mapping step across lexicons. For instance, what if you translate in the other direction where the single ILI is "split" into two? Or if the translation is between two other lexicons with different changes in mappings.
At this moment, using the PWN sense keys for translation seems to be the only way to bypass the problem in Wn.
You mean to look for senses with the same sense keys across lexicons? That might work to build the merge-mapping yourself, but it wouldn't be a solution in general because senses link synsets to words and therefore non-English lexicons should have different sense keys (but more likely they do not have them at all).
Here's how you could build the mapping:
>>> import wn
>>> en30 = wn.Wordnet('omw-en')
>>> en31 = wn.Wordnet('omw-en31')
>>> en31_sensekey_ili_map = {
... s.metadata()['identifier']: s.synset().ili
... for s in en31.senses()
... }
>>> en30_31_ilis = {ss.ili.id: set() for ss in en30.synsets()}
>>> for s in en30.senses():
... ili = en31_sensekey_ili_map.get(s.metadata()['identifier'])
... if ili:
... en30_31_ilis[s.synset().ili.id].add(ili.id)
...
>>> en30_31_ilis['i37881']
{'i37882'}
>>> en30_31_ilis['i37882']
{'i37882'}
This mapping is unidirectional, PWN 3.0 to PWN 3.1, but maybe it is useful nonetheless.
from wn.
Thanks @goodmami, I have corrected the formulation, since I don't want to imply that something is wrong with Wn. On the other hand, there is a problem in Wn, due to the way that the CILI mappings are applied, but I realize that this happens in OMW-data, when building the LMF databases.
I want to look more into this, and am missing a way to lookup the CILI mappings from within Wn. The CILI project is installed, but I have not yet found out how to load and query it.
from wn.
Thanks @goodmami, yes your explanations help a lot indeed.
Concerning my specific problem, i.e. obtaining translations for synsets that would have one according to the CILI, but had none when querying Wn, the code you provided for mapping from en-30 ilis to en-31 ilis indeed solves the problem for en-31. With oewn, a detour is necessary, since it has sensekeys encoded as sense.id, but it works equally well:
import wn
#---------------------------------------------------------------------
# adapted from english-wordnet/scripts/wordnet_yaml.py, by @jmccrae:
def unmap_sense_key(sk):
e = sk.split("__")
l = e[0][5:]
r = "__".join(e[1:])
return (l.replace("-ap-", "'").replace("-sl-", "/").replace("-ex-", "!").replace("-cm-",",").replace("-cl-",":") +
"%" + r.replace(".", ":").replace("-sp-","_"))
#---------------------------------------------------------------------
def sense2key(sense, wnid="omw-en"):
if wnid == 'oewn':
return unmap_sense_key(sense.id)
else:
return sense.metadata()['identifier']
def map30(target):
wnet = wn.Wordnet(target)
wnid = wnet.lexicons()[0].id
sk_ili = {sense2key(se, wnid): se.synset().ili for se in wnet.senses()}
ilimap30 = {}
for se in wn.Wordnet("omw-en").senses():
ili = sk_ili.get(se.metadata()['identifier'])
if ili and ili.status != "proposed":
ilimap30[se.synset().ili.id] = ili.id
return ilimap30
#---------------------------------------------------------------------
#target = "omw-en31"
target = "oewn"
ilimap = map30(target)
i1 = "i37881"
i2 = "i37882"
print(ilimap[i1])
i37882
print(ilimap[i2])
i37882
wnfi = wn.Wordnet("omw-fi")
wn2 = wn.Wordnet(target)
ss1 = wnfi.synsets(ili = i1)[0]
print(f"{ss1.ili.id}, {ss1.senses()},\n\
{ss1.translate(target)}, {wn2.synsets(ili=ilimap[i1])[0]}")
Now, the mapping can provide a translation for this Finnish synset, which has none using Wn's translate() function.
i37881, [Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')],
[], Synset('oewn-00472688-n')
So in Wn at present, we have to go through sense-key mappings in order to avoid this problem. I suppose there could be a more direct way to use the CILI mappings, without necessarily losing synsets in the translation, since CILI contains information about the merged synsets. But even then, it remains to be seen whether ILI mappings can match the performance of sense-key mappings.
from wn.
As @goodmami wrote:
what if you translate in the other direction where the single ILI is "split" into two?
Yes, the inverse problem is that currently, when translating in the opposite direction, Wn only returns one of the merged synsets:
i2 = "i37882"
print(wn.Wordnet("oewn").synsets(ili = i2)[0].translate("omw-fi"))
[Synset('omw-fi-00474568-n')]
In that case, the complete translation would be the union of the senses belonging to all the synsets obtained by reversing the ilimap from above:
def rev_dict(dic):
rdic = {}
for key,val in dic.items():
if val not in rdic:
rdic[val] = {key}
else:
rdic[val].add(key)
return rdic
sources = rev_dict(ilimap)[i2]
print(f"{sources} --> {i2}")
{'i37881', 'i37882'} --> i37882
print([wn.Wordnet("omw-fi").synsets(ili = i)[0].senses() for i in sources])
[[Sense('omw-fi-baseball-00471613-n'), Sense('omw-fi-baseball--peli-00471613-n')], [Sense('omw-fi-baseball-00474568-n')]]
from wn.
@ekaf Can you remind me what is the expected fix here? Currently I'm leaning toward saying this is a data challenge (best solved with documentation) and not a bug or missing feature in the code, but maybe you have something in mind that would be appropriate for this library.
from wn.
There are relatively few (around 30) merged synsets between each English Wordnet version, so losing 30 synsets in translation may not seem a huge problem. However, it is not solved with documentation alone, and a solution in the library appears more helpul.
Since version 3.6.6 (see nltk/#2889), NLTK's wordnet.py library produces a sense-key based mapping "on the fly", at load time, preventing this problem from ever occurring. A similar approach can work in Wn, using code like in the comment above.
An alternative could be if the ILI project also produces lists of merged synsets, with one (or more) synset(s) deprecated and linked to a target synset. This approach is less versatile, because each future English Wordnet needs a separate list of deprecations: you would have to wait for such lists to be produced, then rely on their adequacy, and still need additional code to interpret the deprecations in Wn.
from wn.
@ekaf thank you for explaining. I'm not entirely sold on this solution because it encodes lexicon-specific information (the sense keys and where they are stored), which are really only relevant for the English wordnets, and I strive as much as possible for Wn to not favor any particular wordnet or language (with the exception of the included Morphy lemmatizer).
That said, so many wordnets are based on the English structure that it might make sense for practicality to beat purity here. The ILI solution would be more "pure", but, as you describe, that approach has other issues.
@fcbond, I'd like to get your perspective. Should Wn codify English-specific workarounds for merged synsets across wordnet versions? Or maybe the problem is rare enough that some documentation of the problem with a recipe for getting around it would suffice?
from wn.
Related Issues (20)
- Document "default mode" queries
- Missing Spanish definitions HOT 3
- antonyms in languages other than German and English HOT 5
- Support for PTB and Universal POS tags HOT 6
- If you create an entry with an ILIDefinition, but ill.id='' you lose the definition HOT 2
- Tracing back 'inferred' synsets to their reference lexicons HOT 3
- Stumped by multilingual relation traversal HOT 7
- Synset.relations() for some lexicons uses synset id as relation name
- Update Python versions HOT 2
- Is there any mapping between different English wordnet? HOT 4
- synset.relations fails with a KeyError HOT 3
- SQLite objects created in a thread can only be used in that same thread HOT 3
- Add OEWN 2022 to index HOT 2
- Addition to NLTK migration guide w.r.t. offsets HOT 4
- Add a `conda` install option for `wn` on conda-forge channel HOT 5
- Allow contributors to self-assign issues with GitHub workflow HOT 3
- pyproject.toml: Fix ruff rules in tool.ruff.ignore
- Update Python versions, 3.8 to 3.12
- Add OEWN 2023 to index
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wn.