Giter Club home page Giter Club logo

Comments (5)

jnothman avatar jnothman commented on August 29, 2024

CoNLL coref is not the same as CoNLL-AIDA. I have clarified this in the docs. Please remind me what the AIDA dataset looks like?

from neleval.

userofgithub1 avatar userofgithub1 commented on August 29, 2024

Oops sorry I missed that in the docs. Here is the format of CoNLL-AIDA:

-DOCSTART- (1 EU)
EU	B	EU	--NME--
rejects
German	B	German	Germany	http://en.wikipedia.org/wiki/Germany	/m/0345h
call
to
boycott
British	B	British	United_Kingdom	http://en.wikipedia.org/wiki/United_Kingdom	/m/07ssc
lamb
.

Peter	B	Peter Blackburn	--NME--
Blackburn	I	Peter Blackburn	--NME--

BRUSSELS	B	BRUSSELS	Brussels	http://en.wikipedia.org/wiki/Brussels	/m/0177z
1996-08-22

The
European	B	European Commission	European_Commission	http://en.wikipedia.org/wiki/European_Commission	/m/02q9k
Commission	I	European Commission	European_Commission	http://en.wikipedia.org/wiki/European_Commission	/m/02q9k
said
on
Thursday
it
disagreed

And this is the format of the system output which I believe is accepted by $ neleval evaluate:

1164testb RUGBY	1474	1491	en.wikipedia.org/wiki/Andrea_Castellani	1.0	PERSON
1164testb RUGBY	1452	1471	en.wikipedia.org/wiki/Alessandro_Moscardi	1.0	PERSON
1164testb RUGBY	1433	1449	en.wikipedia.org/wiki/Nicola_Mazzucato	1.0	ORG
1164testb RUGBY	1416	1430	en.wikipedia.org/wiki/Gianluca_Guidi	1.0	PERSON

Thank you so much,

from neleval.

jnothman avatar jnothman commented on August 29, 2024
def aida_to_neleval(f, iob_col=2, kbid_col=3):
    def emit():
        if 'start' not in cur:
            return
        kbid = cur['kbid']
        if kbid == '--NME--':
            kbid = 'NIL0000000'
        print(docid, cur['start'], offset, kbid, sep='\t')
        del cur['start']
        del cur['kbid']

    cur = {}
    for l in f:
        l = l.rstrip()
        if not l:
            continue
        elif l.startswith('-DOCSTART-'):
            emit()
            docid = l[len('-DOCSTART-'):].strip().replace(' ', '_')
            offset = 0
        else:
            offset += 1
            cols = l.split('\t')
            if len(cols) == 1:
                emit()
                continue
            if cols[iob_col] == 'B' or cols[kbid_col] != cur.get('kbid'):
                emit()
                cur['start'] = offset
                cur['kbid'] = cols[kbid_col]


if __name__ == '__main__':
    import io

    f = io.StringIO('''
-DOCSTART- (1 EU)
EU\tB\tEU\t--NME--
rejects
German\tB\tGerman\tGermany\thttp://en.wikipedia.org/wiki/Germany\t/m/0345h
call
to
boycott
British\tB\tBritish\tUnited_Kingdom\thttp://en.wikipedia.org/wiki/United_Kingdom\t/m/07ssc
lamb
.

Peter\tB\tPeter Blackburn\t--NME--
Blackburn\tI\tPeter Blackburn\t--NME--

BRUSSELS\tB\tBRUSSELS\tBrussels\thttp://en.wikipedia.org/wiki/Brussels\t/m/0177z
1996-08-22

The
European\tB\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
Commission\tI\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
said
on
Thursday
it
disagreed
    ''')
    aida_to_neleval(f)

outputs

(1_EU)	1	2	NIL0000000
(1_EU)	3	4	Germany
(1_EU)	7	8	United_Kingdom
(1_EU)	10	12	NIL0000000
(1_EU)	12	13	Brussels
(1_EU)	15	17	European_Commission

I'll look into making a new command out of it.

from neleval.

userofgithub1 avatar userofgithub1 commented on August 29, 2024

Thank you so much. Sorry for the late reply been super busy with other tasks. Will test your code as soon as I get back to this task.

Thanks again :)

from neleval.

userofgithub1 avatar userofgithub1 commented on August 29, 2024

Awesome. I just had to decode before splitting to resolve UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 6: ordinal not in range(128) so I changed cols = l.split('\t') to cols = l.decode('utf-8').split('\t')

I forgot to mention that in some rows the kbid actually has a mixed format which looks something like: People\u0027s_Republic_of_China while it should be People's_Republic_of_China complete rows would look like this:

.

But
Le	B	Le Matin	Le_Matin_\u0028France\u0029	http://en.wikipedia.org/wiki/Le_Matin_(France)	/m/03nrccn
Matin	I	Le Matin	Le_Matin_\u0028France\u0029	http://en.wikipedia.org/wiki/Le_Matin_(France)	/m/03nrccn
newspaper
,
quoting
witnesses

How can I fix the format if I've decoded in UTF-8 when splitting?

Also, could you take a look at this relatable issue andychisholm/nel#21 it's in Conll.py the offsets are calculated completely wrong I tried many different ways and also tried to apply your method there but it still doesn't calculate correctly. Many many thanks :)

from neleval.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.