wikilinks / neleval Goto Github PK
View Code? Open in Web Editor NEWEntity disambiguation evaluation and error analysis tool
License: Apache License 2.0
Entity disambiguation evaluation and error analysis tool
License: Apache License 2.0
A user on the TAC mailing list wished to evaluate a mix of 2014 and pre-2014 EL tasks, such that precision should indicate precision of linking/clustering, ignoring spurious mentions.
While it is not hard to remove spurious mentions using grep
, this may be worth facilitating in one of the following ways:
--ignore-spurious
flag to evaluate
, significance
and confidence
.is_aligned
attribute on Annotation
s that is True for all gold annotations and set with respect to some gold standard when loading annotations from a system output.(3.) would appear to be most flexible and in line with current design.
start by pulling out broken tests from neleval/test.py
into somewhere they actually get run
in the ideal world, we would have always designed it this way....
Should be able to evaluate (the micro-average over documents of) within-document coreference resolution performance. With the current implementation the following approaches exist:
prepare-conll-coref
)Note that the former approach breaks for the pairwise_negative
aggregate, as true negatives from across the corpus will be counted.
My current preferred solution is to add an option to evaluate
: which fields to break the calculation down by, ordinarily 'doc' but perhaps also 'type' would be of interest. Evaluate would then calculate all measures over each, then add results for micro-average and macro-average. This would also mean we can rename the aggregate sets-micro
to sets
.
Thanks for expressing the need for this, @shyamupa
I've had output where the first report line has strong_all_match
, and others with strong_nil_match
. The numbers are the same, but the re-ordered lines is a bit confusing and a stable order would be great.
I believe there is a 2 missing in the definition of the dice coefficient which causes the true positives and false positives to be off by a factor of 2.
Line 343 of https://github.com/wikilinks/neleval/blob/master/neleval/coref_metrics.py
Luckily this doesn't affect precision, recall or F-score since everything ends up off by a factor of two, so the issue is definitely not urgent, but should be an easy fix.
My understanding of CEAF is based on what I read in http://www.aclweb.org/anthology/H05-1004. On page 28 they define the similarity metrics. It's also possible I've misunderstood how this metric is calculated. If that's the case, kindly let me know.
Allows us to do weighted coref scores.
Should these be nel
?
Hi there, this is my mistake. Sorry about this issue.
Thank you for your work!
Arda
I want to cite the tool. Which paper should I cite?
TAC KBP overviews or the ACL Paper "Cheap and easy entity evaluation"?
We could save on so many scripts and their debugging with a bit of pandas.
We are 2016 kbp team,meet a problem,the results obtained with this procedure are not the same as the official results of TAC
sudo sh ./scripts/run_tac15_evaluation.sh /home/cb/eval/mentions_gold_name.xml /home/cb/eval/tac_kbp_2015_tedl_evaluation_gold_standard_entity_mentions.tab /home/cb/output /home/cb/evalre
i dont know what are the whole arguments?
Where can I find a readable introduction of the arguments?
this is the message happened when i try to run the scripts:
INFO Converting gold to evaluation format..
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/neleval/main.py", line 91, in
main()
File "/usr/local/lib/python2.7/dist-packages/neleval/main.py", line 86, in main
result = obj()
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 193, in call
return u'\n'.join(unicode(a) for a in self.read_annotations(self.system))
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 193, in
return u'\n'.join(unicode(a) for a in self.read_annotations(self.system))
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 212, in read_annotations
key=key_fn), key_fn)
IndexError: list index out of range
INFO converting systems to evaluation format..
ls: cannot access /home/cb/eval/tac_kbp_2015_tedl_evaluation_gold_standard_entity_mentions.tab/*: Not a directory
xargs: invalid number for -P option
Usage: xargs [-0prtx] [--interactive] [--null] [-d|--delimiter=delim]
[-E eof-str] [-e[eof-str]] [--eof[=eof-str]]
[-L max-lines] [-l[max-lines]] [--max-lines[=max-lines]]
[-I replace-str] [-i[replace-str]] [--replace[=replace-str]]
[-n max-args] [--max-args=max-args]
[-s max-chars] [--max-chars=max-chars]
[-P max-procs] [--max-procs=max-procs] [--show-limits]
[--verbose] [--exit] [--no-run-if-empty] [--arg-file=file]
[--version] [--help] [command [initial-arguments]]
I have a minimal case that seems to break the scorer (and I confess, I'm using the scorer for an entity clustering and linking evaluation which isn't TAC KBP).
Let's say your corpus contains two documents. Document 1 contains a gold mention, and document 2 contains no gold mentions; for the evaluated system, document 1 contains no mentions and document 2 contains one mention. This minimal case causes the scorer to fail as follows:
INFO Converting gold to evaluation format..
INFO Converting systems to evaluation format..
INFO Evaluating systems..
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
INFO Preparing summary report..
INFO Calculating confidence intervals..
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
INFO preparing strong_link_match report..
INFO preparing strong_nil_match report..
INFO preparing strong_all_match report..
INFO preparing strong_typed_link_match report..
INFO Preparing error report..
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Volumes/Blinken/Projects/NEET-CO/TAC_2014_scoring/neleval/neleval/main.py", line 60, in
main()
File "/Volumes/Blinken/Projects/NEET-CO/TAC_2014_scoring/neleval/neleval/main.py", line 57, in main
print(obj())
File "neleval/analyze.py", line 75, in call
counts = Counter(error.label for error in _data())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 444, in init
self.update(iterable, **kwds)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 525, in update
for elem in iterable:
File "neleval/analyze.py", line 75, in
counts = Counter(error.label for error in _data())
File "neleval/analyze.py", line 86, in iter_errors
assert g.id == s.id
AssertionError
I'd be happy to send you the minimal test I've set up, if you need it. I'd try to fix it myself, but I'm hoping that you'll be faster :-).
I'm working with the latest clone of the the repository. MacOS 10.9.5, Python 2.7.5, numpy-1.9.0, scipy-0.14.0, joblib, 0.8.3-r1, nose 1.3.4.
Not sure if you care, but the run_tac14_report.sh script fails on MacOS X 10.9 because the -P option has been removed from BSD grep. It was there in 10.8. Recommended solution is to install GNU grep (not a practical requirement) or replace with awk or perl:
http://stackoverflow.com/questions/16658333/grep-p-no-longer-works-how-can-i-rewrite-my-searches
The main goal is to prevent input that will break the evaluation measure implementations here. We could also provide warnings as a convenience to help users ensure annotation/output meets their expectations.
Duplicate mentions cause problems. The tool should print an error and exit if mention/query IDs have the same span.
Whether nested mentions are desirable depends on the task definition. For instance, they are allowable in TAC14 but not in CoNLL/AIDA. The tool could print a warning as a convenience. Optionally, we could provide a flag to tidy nesting, e.g., by removing inner mentions.
Again, the tool could print a warning as a convenience.
LEA appears to define its P (R) as macro-averaged P (R) over pairs, weighted by entity size (asymmetrically, such that the recall is weighted by entity prevalence in the gold standard), with the exception that singleton clusters are treated as a single pair. (Is that correct, @ns-moosavi?)
I'm not sure if LEA is used in practice, yet. In particular I have my doubts about how principled the handling of singletons is. More consistent would be to use link(n) = n^2 / 2
instead of n(n-1)/2
so that every mention gets granted its singleton link. But this would be identical to B-cubed if I'm not mistaken.
Usually people care about CEAFe, MUC, PW and B3 for general coreference problems.
Currently, if I do not specify anything,
../neleval/nel evaluate -g gold/gold.file -f tab pred.file
I get a lot of output, which I really do not care about. On the other hand, I can specify,
../neleval/nel evaluate -g gold/gold.file -f tab pred.file -m muc
which gives me only one. Maybe there should be a way to specify multiple metrics
-m muc,b_cubed,pairwise
Or have a flag for printing the common metrics?
./scripts/run_tac14_all.sh: line 68: ./scripts/run_report_confidence: No such file or directory
Should be run_report_confidence.sh.
I tried running the prepare-conll-coref
no file is generated.
$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv
And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?
Thanks in advance,
I've hacked this into Reader/Document at https://github.com/wikilinks/neleval/compare/merge-duplicates?expand=1, but given that there is a facility to read in multiple candidates from each line of a .tsv, this should really be happening in prepare-tac.
But there should still be an error raised if the gold data for prepare-tac has multiple candidates for any annotation. So prepare-tac should probably have a --gold mode.
Hi, I'm getting a segfault on evaluation:
$ ./nel evaluate -m all -f tab -g ../scoring/gold/e54_v11.kbp_train.combined.tab ../scoring/0.4/kbp_train.combined.tsv
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
./nel: line 2: 11358 Segmentation fault (core dumped) python -m neleval.__main__ "$@"
I've tried this on two different machines, and on both my train and test splits of the LDC2014E54 data, without luck. Oddly, it will run on the first half and the last half of a file, even on the first two-thirds and the last two-thirds, but not on the whole thing. Any ideas?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.