wikilinks / neleval Goto Github PK

View Code? Open in Web Editor NEW

115.0 115.0 23.0 1.57 MB

Entity disambiguation evaluation and error analysis tool

License: Apache License 2.0

Shell 23.43% Python 76.57%

neleval's People

Contributors

Stargazers

Watchers

neleval's Issues

Support evaluating with incomplete gold standards

A user on the TAC mailing list wished to evaluate a mix of 2014 and pre-2014 EL tasks, such that precision should indicate precision of linking/clustering, ignoring spurious mentions.

While it is not hard to remove spurious mentions using grep, this may be worth facilitating in one of the following ways:

a command to output the subset of a dataset that aligns to the gold standard.
an --ignore-spurious flag to evaluate, significance and confidence.
an is_aligned attribute on Annotations that is True for all gold annotations and set with respect to some gold standard when loading annotations from a system output.

(3.) would appear to be most flexible and in line with current design.

improve test coverage

start by pulling out broken tests from neleval/test.py into somewhere they actually get run

use streams rather than strings to output from call

in the ideal world, we would have always designed it this way....

Within-document evaluation mode for cross-doc coreference evaluation

Should be able to evaluate (the micro-average over documents of) within-document coreference resolution performance. With the current implementation the following approaches exist:

append document ID to entity ID manually (or using prepare-conll-coref)
score each document individually by splitting the input, then aggregate

Note that the former approach breaks for the pairwise_negative aggregate, as true negatives from across the corpus will be counted.

My current preferred solution is to add an option to evaluate: which fields to break the calculation down by, ordinarily 'doc' but perhaps also 'type' would be of interest. Evaluate would then calculate all measures over each, then add results for micro-average and macro-average. This would also mean we can rename the aggregate sets-micro to sets.

Thanks for expressing the need for this, @shyamupa

SPHINXOPTS=-W in travis doesn't appear to be working

The order of lines in evaluate output can change

I've had output where the first report line has strong_all_match, and others with strong_nil_match. The numbers are the same, but the re-ordered lines is a bit confusing and a stable order would be great.

Entity CEAF true positives and false positives off by factor of 2.

I believe there is a 2 missing in the definition of the dice coefficient which causes the true positives and false positives to be off by a factor of 2.
Line 343 of https://github.com/wikilinks/neleval/blob/master/neleval/coref_metrics.py
Luckily this doesn't affect precision, recall or F-score since everything ends up off by a factor of two, so the issue is definitely not urgent, but should be an easy fix.

My understanding of CEAF is based on what I read in http://www.aclweb.org/anthology/H05-1004. On page 28 they define the similarity metrics. It's also possible I've misunderstood how this metric is calculated. If that's the case, kindly let me know.

Reimplement coref metrics in terms of contingency matrix and missing/spurious weights

Allows us to do weighted coref scores.

I am a KBP

Stale references to `cne` in usage wiki page?

Should these be nel?

Duplicate NIL cases result in wrong results

Hi there, this is my mistake. Sorry about this issue.

Thank you for your work!
Arda

Citation?

I want to cite the tool. Which paper should I cite?

TAC KBP overviews or the ACL Paper "Cheap and easy entity evaluation"?

Rewrite neleval.summary to use dataframes to store previous results

We could save on so many scripts and their debugging with a bit of pandas.

The results obtained with this procedure are not the same as the official results of TAC

We are 2016 kbp team,meet a problem,the results obtained with this procedure are not the same as the official results of TAC

i can't run the shell scripts

sudo sh ./scripts/run_tac15_evaluation.sh /home/cb/eval/mentions_gold_name.xml /home/cb/eval/tac_kbp_2015_tedl_evaluation_gold_standard_entity_mentions.tab /home/cb/output /home/cb/evalre

i dont know what are the whole arguments?
Where can I find a readable introduction of the arguments?
this is the message happened when i try to run the scripts:
INFO Converting gold to evaluation format..
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/neleval/main.py", line 91, in
main()
File "/usr/local/lib/python2.7/dist-packages/neleval/main.py", line 86, in main
result = obj()
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 193, in call
return u'\n'.join(unicode(a) for a in self.read_annotations(self.system))
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 193, in
return u'\n'.join(unicode(a) for a in self.read_annotations(self.system))
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 212, in read_annotations
key=key_fn), key_fn)
IndexError: list index out of range
INFO converting systems to evaluation format..
ls: cannot access /home/cb/eval/tac_kbp_2015_tedl_evaluation_gold_standard_entity_mentions.tab/*: Not a directory
xargs: invalid number for -P option
Usage: xargs [-0prtx] [--interactive] [--null] [-d|--delimiter=delim]
[-E eof-str] [-e[eof-str]] [--eof[=eof-str]]
[-L max-lines] [-l[max-lines]] [--max-lines[=max-lines]]
[-I replace-str] [-i[replace-str]] [--replace[=replace-str]]
[-n max-args] [--max-args=max-args]
[-s max-chars] [--max-chars=max-chars]
[-P max-procs] [--max-procs=max-procs] [--show-limits]
[--verbose] [--exit] [--no-run-if-empty] [--arg-file=file]
[--version] [--help] [command [initial-arguments]]

Error in nel with non-matching mentions

I have a minimal case that seems to break the scorer (and I confess, I'm using the scorer for an entity clustering and linking evaluation which isn't TAC KBP).

Let's say your corpus contains two documents. Document 1 contains a gold mention, and document 2 contains no gold mentions; for the evaluated system, document 1 contains no mentions and document 2 contains one mention. This minimal case causes the scorer to fail as follows:

INFO Converting gold to evaluation format..
INFO Converting systems to evaluation format..
INFO Evaluating systems..
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
INFO Preparing summary report..
INFO Calculating confidence intervals..
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
INFO preparing strong_link_match report..
INFO preparing strong_nil_match report..
INFO preparing strong_all_match report..
INFO preparing strong_typed_link_match report..
INFO Preparing error report..
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Volumes/Blinken/Projects/NEET-CO/TAC_2014_scoring/neleval/neleval/main.py", line 60, in
main()
File "/Volumes/Blinken/Projects/NEET-CO/TAC_2014_scoring/neleval/neleval/main.py", line 57, in main
print(obj())
File "neleval/analyze.py", line 75, in call
counts = Counter(error.label for error in _data())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 444, in init
self.update(iterable, **kwds)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 525, in update
for elem in iterable:
File "neleval/analyze.py", line 75, in
counts = Counter(error.label for error in _data())
File "neleval/analyze.py", line 86, in iter_errors
assert g.id == s.id
AssertionError

I'd be happy to send you the minimal test I've set up, if you need it. I'd try to fix it myself, but I'm hoping that you'll be faster :-).

I'm working with the latest clone of the the repository. MacOS 10.9.5, Python 2.7.5, numpy-1.9.0, scipy-0.14.0, joblib, 0.8.3-r1, nose 1.3.4.

grep -P no longer works on MacOS X

Not sure if you care, but the run_tac14_report.sh script fails on MacOS X 10.9 because the -P option has been removed from BSD grep. It was there in 10.8. Recommended solution is to install GNU grep (not a practical requirement) or replace with awk or perl:

http://stackoverflow.com/questions/16658333/grep-p-no-longer-works-how-can-i-rewrite-my-searches

Add data validation

The main goal is to prevent input that will break the evaluation measure implementations here. We could also provide warnings as a convenience to help users ensure annotation/output meets their expectations.

Duplicate mentions

Duplicate mentions cause problems. The tool should print an error and exit if mention/query IDs have the same span.

Nested mentions

Whether nested mentions are desirable depends on the task definition. For instance, they are allowable in TAC14 but not in CoNLL/AIDA. The tool could print a warning as a convenience. Optionally, we could provide a flag to tidy nesting, e.g., by removing inner mentions.

Crossing mentions

Again, the tool could print a warning as a convenience.

Consider adding LEA to coref evaluation metrics?

LEA appears to define its P (R) as macro-averaged P (R) over pairs, weighted by entity size (asymmetrically, such that the recall is weighted by entity prevalence in the gold standard), with the exception that singleton clusters are treated as a single pair. (Is that correct, @ns-moosavi?)

I'm not sure if LEA is used in practice, yet. In particular I have my doubts about how principled the handling of singletons is. More consistent would be to use link(n) = n^2 / 2 instead of n(n-1)/2 so that every mention gets granted its singleton link. But this would be identical to B-cubed if I'm not mistaken.

Feature Request: Specifying multiple metrics for file

Usually people care about CEAFe, MUC, PW and B3 for general coreference problems.
Currently, if I do not specify anything,

../neleval/nel evaluate -g gold/gold.file -f tab pred.file

I get a lot of output, which I really do not care about. On the other hand, I can specify,

../neleval/nel evaluate -g gold/gold.file -f tab pred.file -m muc

which gives me only one. Maybe there should be a way to specify multiple metrics

-m muc,b_cubed,pairwise

Or have a flag for printing the common metrics?

i can't run the shell scripts 不要

Typo in run_tac14_all.sh

./scripts/run_tac14_all.sh: line 68: ./scripts/run_report_confidence: No such file or directory

Should be run_report_confidence.sh.

prepare-conll-coref does not convert AIDA-YAGO2-dataset

I tried running the prepare-conll-coref no file is generated.

$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv

And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?

Thanks in advance,

prepare-tac should merge multiple candidates at same location

I've hacked this into Reader/Document at https://github.com/wikilinks/neleval/compare/merge-duplicates?expand=1, but given that there is a facility to read in multiple candidates from each line of a .tsv, this should really be happening in prepare-tac.

But there should still be an error raised if the gold data for prepare-tac has multiple candidates for any annotation. So prepare-tac should probably have a --gold mode.

segfault?

Hi, I'm getting a segfault on evaluation:

$ ./nel evaluate -m all -f tab -g ../scoring/gold/e54_v11.kbp_train.combined.tab ../scoring/0.4/kbp_train.combined.tsv 
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
  StrictMetricWarning)
./nel: line 2: 11358 Segmentation fault      (core dumped) python -m neleval.__main__ "$@"

I've tried this on two different machines, and on both my train and test splits of the LDC2014E54 data, without luck. Oddly, it will run on the first half and the last half of a file, even on the first two-thirds and the last two-thirds, but not on the whole thing. Any ideas?

wikilinks / neleval Goto Github PK

neleval's People

Contributors

Stargazers

Watchers

Forkers

neleval's Issues

Duplicate mentions

Nested mentions

Crossing mentions

Recommend Projects

Recommend Topics

Recommend Org