I tried running the prepare-conll-coref no file is ge

<div class="highlight highlight-source-python notranslate position-relative overflow-auto" dir="auto

prepare-conll-coref does not convert AIDA-YAGO2-dataset about neleval HOT 5 OPEN

wikilinks commented on August 29, 2024

prepare-conll-coref does not convert AIDA-YAGO2-dataset

from neleval.

Comments (5)

jnothman commented on August 29, 2024

CoNLL coref is not the same as CoNLL-AIDA. I have clarified this in the docs. Please remind me what the AIDA dataset looks like?

from neleval.

userofgithub1 commented on August 29, 2024

Oops sorry I missed that in the docs. Here is the format of CoNLL-AIDA:

-DOCSTART- (1 EU)
EU	B	EU	--NME--
rejects
German	B	German	Germany	http://en.wikipedia.org/wiki/Germany	/m/0345h
call
to
boycott
British	B	British	United_Kingdom	http://en.wikipedia.org/wiki/United_Kingdom	/m/07ssc
lamb
.

Peter	B	Peter Blackburn	--NME--
Blackburn	I	Peter Blackburn	--NME--

BRUSSELS	B	BRUSSELS	Brussels	http://en.wikipedia.org/wiki/Brussels	/m/0177z
1996-08-22

The
European	B	European Commission	European_Commission	http://en.wikipedia.org/wiki/European_Commission	/m/02q9k
Commission	I	European Commission	European_Commission	http://en.wikipedia.org/wiki/European_Commission	/m/02q9k
said
on
Thursday
it
disagreed

And this is the format of the system output which I believe is accepted by $ neleval evaluate:

1164testb RUGBY	1474	1491	en.wikipedia.org/wiki/Andrea_Castellani	1.0	PERSON
1164testb RUGBY	1452	1471	en.wikipedia.org/wiki/Alessandro_Moscardi	1.0	PERSON
1164testb RUGBY	1433	1449	en.wikipedia.org/wiki/Nicola_Mazzucato	1.0	ORG
1164testb RUGBY	1416	1430	en.wikipedia.org/wiki/Gianluca_Guidi	1.0	PERSON

Thank you so much,

from neleval.

jnothman commented on August 29, 2024

def aida_to_neleval(f, iob_col=2, kbid_col=3):
    def emit():
        if 'start' not in cur:
            return
        kbid = cur['kbid']
        if kbid == '--NME--':
            kbid = 'NIL0000000'
        print(docid, cur['start'], offset, kbid, sep='\t')
        del cur['start']
        del cur['kbid']

    cur = {}
    for l in f:
        l = l.rstrip()
        if not l:
            continue
        elif l.startswith('-DOCSTART-'):
            emit()
            docid = l[len('-DOCSTART-'):].strip().replace(' ', '_')
            offset = 0
        else:
            offset += 1
            cols = l.split('\t')
            if len(cols) == 1:
                emit()
                continue
            if cols[iob_col] == 'B' or cols[kbid_col] != cur.get('kbid'):
                emit()
                cur['start'] = offset
                cur['kbid'] = cols[kbid_col]


if __name__ == '__main__':
    import io

    f = io.StringIO('''
-DOCSTART- (1 EU)
EU\tB\tEU\t--NME--
rejects
German\tB\tGerman\tGermany\thttp://en.wikipedia.org/wiki/Germany\t/m/0345h
call
to
boycott
British\tB\tBritish\tUnited_Kingdom\thttp://en.wikipedia.org/wiki/United_Kingdom\t/m/07ssc
lamb
.

Peter\tB\tPeter Blackburn\t--NME--
Blackburn\tI\tPeter Blackburn\t--NME--

BRUSSELS\tB\tBRUSSELS\tBrussels\thttp://en.wikipedia.org/wiki/Brussels\t/m/0177z
1996-08-22

The
European\tB\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
Commission\tI\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
said
on
Thursday
it
disagreed
    ''')
    aida_to_neleval(f)

outputs

(1_EU)	1	2	NIL0000000
(1_EU)	3	4	Germany
(1_EU)	7	8	United_Kingdom
(1_EU)	10	12	NIL0000000
(1_EU)	12	13	Brussels
(1_EU)	15	17	European_Commission

I'll look into making a new command out of it.

from neleval.

userofgithub1 commented on August 29, 2024

Thank you so much. Sorry for the late reply been super busy with other tasks. Will test your code as soon as I get back to this task.

Thanks again :)

from neleval.

userofgithub1 commented on August 29, 2024

Awesome. I just had to decode before splitting to resolve UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 6: ordinal not in range(128) so I changed cols = l.split('\t') to cols = l.decode('utf-8').split('\t')

I forgot to mention that in some rows the kbid actually has a mixed format which looks something like: People\u0027s_Republic_of_China while it should be People's_Republic_of_China complete rows would look like this:

.

But
Le	B	Le Matin	Le_Matin_\u0028France\u0029	http://en.wikipedia.org/wiki/Le_Matin_(France)	/m/03nrccn
Matin	I	Le Matin	Le_Matin_\u0028France\u0029	http://en.wikipedia.org/wiki/Le_Matin_(France)	/m/03nrccn
newspaper
,
quoting
witnesses

How can I fix the format if I've decoded in UTF-8 when splitting?

Also, could you take a look at this relatable issue andychisholm/nel#21 it's in Conll.py the offsets are calculated completely wrong I tried many different ways and also tried to apply your method there but it still doesn't calculate correctly. Many many thanks :)

from neleval.

prepare-conll-coref does not convert AIDA-YAGO2-dataset about neleval HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent