Giter Club home page Giter Club logo

neleval's People

Contributors

benhachey avatar jnothman avatar wejradford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neleval's Issues

Support evaluating with incomplete gold standards

A user on the TAC mailing list wished to evaluate a mix of 2014 and pre-2014 EL tasks, such that precision should indicate precision of linking/clustering, ignoring spurious mentions.

While it is not hard to remove spurious mentions using grep, this may be worth facilitating in one of the following ways:

  1. a command to output the subset of a dataset that aligns to the gold standard.
  2. an --ignore-spurious flag to evaluate, significance and confidence.
  3. an is_aligned attribute on Annotations that is True for all gold annotations and set with respect to some gold standard when loading annotations from a system output.

(3.) would appear to be most flexible and in line with current design.

improve test coverage

start by pulling out broken tests from neleval/test.py into somewhere they actually get run

Within-document evaluation mode for cross-doc coreference evaluation

Should be able to evaluate (the micro-average over documents of) within-document coreference resolution performance. With the current implementation the following approaches exist:

  • append document ID to entity ID manually (or using prepare-conll-coref)
  • score each document individually by splitting the input, then aggregate

Note that the former approach breaks for the pairwise_negative aggregate, as true negatives from across the corpus will be counted.

My current preferred solution is to add an option to evaluate: which fields to break the calculation down by, ordinarily 'doc' but perhaps also 'type' would be of interest. Evaluate would then calculate all measures over each, then add results for micro-average and macro-average. This would also mean we can rename the aggregate sets-micro to sets.

Thanks for expressing the need for this, @shyamupa

The order of lines in evaluate output can change

I've had output where the first report line has strong_all_match, and others with strong_nil_match. The numbers are the same, but the re-ordered lines is a bit confusing and a stable order would be great.

Entity CEAF true positives and false positives off by factor of 2.

I believe there is a 2 missing in the definition of the dice coefficient which causes the true positives and false positives to be off by a factor of 2.
Line 343 of https://github.com/wikilinks/neleval/blob/master/neleval/coref_metrics.py
Luckily this doesn't affect precision, recall or F-score since everything ends up off by a factor of two, so the issue is definitely not urgent, but should be an easy fix.

My understanding of CEAF is based on what I read in http://www.aclweb.org/anthology/H05-1004. On page 28 they define the similarity metrics. It's also possible I've misunderstood how this metric is calculated. If that's the case, kindly let me know.

Citation?

I want to cite the tool. Which paper should I cite?

TAC KBP overviews or the ACL Paper "Cheap and easy entity evaluation"?

i can't run the shell scripts

sudo sh ./scripts/run_tac15_evaluation.sh /home/cb/eval/mentions_gold_name.xml /home/cb/eval/tac_kbp_2015_tedl_evaluation_gold_standard_entity_mentions.tab /home/cb/output /home/cb/evalre

i dont know what are the whole arguments?
Where can I find a readable introduction of the arguments?
this is the message happened when i try to run the scripts:
INFO Converting gold to evaluation format..
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/neleval/main.py", line 91, in
main()
File "/usr/local/lib/python2.7/dist-packages/neleval/main.py", line 86, in main
result = obj()
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 193, in call
return u'\n'.join(unicode(a) for a in self.read_annotations(self.system))
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 193, in
return u'\n'.join(unicode(a) for a in self.read_annotations(self.system))
File "/usr/local/lib/python2.7/dist-packages/neleval/tac.py", line 212, in read_annotations
key=key_fn), key_fn)
IndexError: list index out of range
INFO converting systems to evaluation format..
ls: cannot access /home/cb/eval/tac_kbp_2015_tedl_evaluation_gold_standard_entity_mentions.tab/*: Not a directory
xargs: invalid number for -P option
Usage: xargs [-0prtx] [--interactive] [--null] [-d|--delimiter=delim]
[-E eof-str] [-e[eof-str]] [--eof[=eof-str]]
[-L max-lines] [-l[max-lines]] [--max-lines[=max-lines]]
[-I replace-str] [-i[replace-str]] [--replace[=replace-str]]
[-n max-args] [--max-args=max-args]
[-s max-chars] [--max-chars=max-chars]
[-P max-procs] [--max-procs=max-procs] [--show-limits]
[--verbose] [--exit] [--no-run-if-empty] [--arg-file=file]
[--version] [--help] [command [initial-arguments]]

Error in nel with non-matching mentions

I have a minimal case that seems to break the scorer (and I confess, I'm using the scorer for an entity clustering and linking evaluation which isn't TAC KBP).

Let's say your corpus contains two documents. Document 1 contains a gold mention, and document 2 contains no gold mentions; for the evaluated system, document 1 contains no mentions and document 2 contains one mention. This minimal case causes the scorer to fail as follows:

INFO Converting gold to evaluation format..
INFO Converting systems to evaluation format..
INFO Evaluating systems..
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
INFO Preparing summary report..
INFO Calculating confidence intervals..
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
StrictMetricWarning)
INFO preparing strong_link_match report..
INFO preparing strong_nil_match report..
INFO preparing strong_all_match report..
INFO preparing strong_typed_link_match report..
INFO Preparing error report..
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Volumes/Blinken/Projects/NEET-CO/TAC_2014_scoring/neleval/neleval/main.py", line 60, in
main()
File "/Volumes/Blinken/Projects/NEET-CO/TAC_2014_scoring/neleval/neleval/main.py", line 57, in main
print(obj())
File "neleval/analyze.py", line 75, in call
counts = Counter(error.label for error in _data())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 444, in init
self.update(iterable, **kwds)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/collections.py", line 525, in update
for elem in iterable:
File "neleval/analyze.py", line 75, in
counts = Counter(error.label for error in _data())
File "neleval/analyze.py", line 86, in iter_errors
assert g.id == s.id
AssertionError

I'd be happy to send you the minimal test I've set up, if you need it. I'd try to fix it myself, but I'm hoping that you'll be faster :-).

I'm working with the latest clone of the the repository. MacOS 10.9.5, Python 2.7.5, numpy-1.9.0, scipy-0.14.0, joblib, 0.8.3-r1, nose 1.3.4.

Add data validation

The main goal is to prevent input that will break the evaluation measure implementations here. We could also provide warnings as a convenience to help users ensure annotation/output meets their expectations.

Duplicate mentions

Duplicate mentions cause problems. The tool should print an error and exit if mention/query IDs have the same span.

Nested mentions

Whether nested mentions are desirable depends on the task definition. For instance, they are allowable in TAC14 but not in CoNLL/AIDA. The tool could print a warning as a convenience. Optionally, we could provide a flag to tidy nesting, e.g., by removing inner mentions.

Crossing mentions

Again, the tool could print a warning as a convenience.

Consider adding LEA to coref evaluation metrics?

LEA appears to define its P (R) as macro-averaged P (R) over pairs, weighted by entity size (asymmetrically, such that the recall is weighted by entity prevalence in the gold standard), with the exception that singleton clusters are treated as a single pair. (Is that correct, @ns-moosavi?)

I'm not sure if LEA is used in practice, yet. In particular I have my doubts about how principled the handling of singletons is. More consistent would be to use link(n) = n^2 / 2 instead of n(n-1)/2 so that every mention gets granted its singleton link. But this would be identical to B-cubed if I'm not mistaken.

Feature Request: Specifying multiple metrics for file

Usually people care about CEAFe, MUC, PW and B3 for general coreference problems.
Currently, if I do not specify anything,

../neleval/nel evaluate -g gold/gold.file -f tab pred.file

I get a lot of output, which I really do not care about. On the other hand, I can specify,

../neleval/nel evaluate -g gold/gold.file -f tab pred.file -m muc

which gives me only one. Maybe there should be a way to specify multiple metrics

-m muc,b_cubed,pairwise

Or have a flag for printing the common metrics?

Typo in run_tac14_all.sh

./scripts/run_tac14_all.sh: line 68: ./scripts/run_report_confidence: No such file or directory

Should be run_report_confidence.sh.

prepare-conll-coref does not convert AIDA-YAGO2-dataset

I tried running the prepare-conll-coref no file is generated.

$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv

And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?

Thanks in advance,

segfault?

Hi, I'm getting a segfault on evaluation:

$ ./nel evaluate -m all -f tab -g ../scoring/gold/e54_v11.kbp_train.combined.tab ../scoring/0.4/kbp_train.combined.tsv 
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
  StrictMetricWarning)
./nel: line 2: 11358 Segmentation fault      (core dumped) python -m neleval.__main__ "$@"

I've tried this on two different machines, and on both my train and test splits of the LDC2014E54 data, without luck. Oddly, it will run on the first half and the last half of a file, even on the first two-thirds and the last two-thirds, but not on the whole thing. Any ideas?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.