The cogs from najoungkim

Curious about the train_100.tsv

Thanks very much for your work @najoungkim and I really like this evaluation. I've got some questions regarding your dataset.

I noticed that the hedgehog has been used for the subj_to_obj_common task. In the train.tsv, you have one exposure example of hedgehog being used as subject but in train_100.tsv, your entire corpus does not contain the word hedgehog. Shouldn't the train_100.tsv have 100 examples of hedgehog as subject according to your description?

I am quite confused about this now, not sure what is the exact process of adding 100 exposure example. How can you be able to generalize hedgehog in the gen dataset when you have none existence of it in the train_100 set. (it seems that you remove some sentences from train.tsv? exposure_example_subj_common does not exist in train_100 anymore...)

(The same occurs in the obj_to_subj_common, where cockroach occurs in obj position once(and only once in the whole corpus) in train.tsv, and none in the train_100.tsv while cockroach is used to test obj_to_subj_common task)

Thanks very much for any response.

OOV token(s) in the generalization set intended?

TL;DR

TL;DR: it can be hard to generalize to sentences containing words never seen during training. Is this part of your definition of what it means to compositionally generalize?

Background

One way to interpret the Principle of Compositionality would be that knowing the meaning of all words (let's ignore problems of disambiguation and idioms for now) is a necessary prerequisite to be able to 'understand' (more precisely, to compute the meaning of) a sentence. In other words, no out-of-vocabulary (OOV) words are allowed.

Imagine you are asked to provide the meaning of a sentence (e.g. as logical form) containing a word that you've never encountered before: would you be able to do so?
The COGS logical forms make it tempting to say yes at least for nouns (only singular number nouns here), because their morphological form on the input side is character-wise identical to a token of the logical form (e.g. 'The boy' translates to * boy ( x _ 1 ) ;). But do we want to rely on that cheap copying trick?
I can see some justification for the case of proper nouns ('names'), but not really for common nouns like 'gardner' (sic!) or 'monastery'.

No matter whether your definition of compositional generalization includes dealing with OOV words or not, I rather have it made explicit, that's why I am raising the issue here.

I actually only stumbled upon this because there were a couple of sentences across all different approaches I tried that never succeeded according to the exact match criterion, even hindering reaching 100% dev set accuracy no matter how long I trained (always stayed at 99.97% due to that one 'gardner' sentence, see below)

Concrete numbers

Using commit version from April 2021: 6f66383
I got the following numbers
(e.g. with grep -c word ./data/*.tsv to count lines (=samples), not word counts):

word	train.tsv	train_100.tsv	dev.tsv	test.tsv	gen.tsv
monastery	0	1	0	0	12
gardner	0	0	1	1	10
---------	---------	-------------	-------	--------	-------
total	0	1	1	1	22

Last number obtained with grep -c 'monastery\|gardner' train.tsv

In the generalization set, the PP recursion generalization type (pp_recursion) seems to be affected most (6 samples with 'monastery', 9 with 'gardner': not overlapping samples).

As a consequence, a model which builds its vocabulary from its training set only
will struggle with 1 sentence each on dev and test, and 22 or 10 samples (depending on whether trained on train.tsv or train_100.tsv) on the gen set.
If the dev set is included in the vocabulary, at least for the train.tsv training the problem of OOV words in the gen set (12 'monastery' samples) remains to some degree.

Question

Long story short, my question is whether you require models to deal with OOV words in order to solve COGS' generalization set and succeed at compositional generalization?

I've read your EMNLP paper which introduced the COGS dataset, but haven't found any comment on that. I would be very glad if you could point me to it in case I missed it.

Thank you in advance!

najoungkim / cogs Goto Github PK

cogs's People

Contributors

Stargazers

Watchers

Forkers

cogs's Issues

Curious about the train_100.tsv

OOV token(s) in the generalization set intended?

TL;DR

Background

Concrete numbers

Question

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent