joaopalotti / trectools Goto Github PK

View Code? Open in Web Editor NEW

160.0 9.0 32.0 19.14 MB

A simple toolkit to process TREC files in Python.

Home Page: https://pypi.python.org/pypi/trectools

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

trectools's People

Stargazers

Watchers

trectools's Issues

Mean Average Precision and Pooled Evaluations

I came across an issue with the calculation of the mean average precision of pooled evaluations which results in mAP@k being lower than expected if k is lesser than the number of relevant documents in the pool.

Let r be the number of relevant documents in the pool, P(q,k) the precision of query q at cut-off k, d(q,i) the document at position i in query q and rel(q,d) the relevance of document d for query q with rel: Q × D -> {0,1}.

TrecTools calculates the average precision@k of a query q as

sum({P(q,i) ⋅ rel(q,d(q,i)) | i = {1,...,k}) / r

instead of

sum({P(q,i) ⋅ rel(q,d(q,i)) | i = {1,...,k}) / min(r,k)

resulting in counterintuitive results for AP (and, thus, mAP) if r < k.

I suggest adding a flag to get_map to set the denominator to min(r,k) instead of r for pooled query evaluations.

I've attached a minimal example with a query of ten documents, with the first five of them being relevant, and a total of ten relevant documents in the evaluation pool, for which TrecTools calculates the mAP@5 as 0.5 instead of 1.0.

qrels.txt
run.txt

tag releases in git

It would be nice for downstream distros if releases would be tagged in git/github.

Numerical difference between trectools and trec_eval in terms of Rprec and ndcg

Hello,

I have a situation where there large differences between some metrics computed using the official trec_eval (as called from pytrec_eval) and trectools. To reproduce, you can run the following (it will download file.qrel and file.run from gist):

Expected output:

trectools_results {'Rprec': 79.06, 'ndcg': 93.69}                                                                                                                       
pytrec_eval_results {'Rprec': 79.62, 'ndcg': 94.35}

The code:

import urllib.request
import numpy as np
from trectools import TrecQrel, TrecRun, TrecEval
import pytrec_eval

##### download qrel and run files

qrel_file_path = 'https://gist.githubusercontent.com/sergeyf/4d88da8d865ccad06cfd140b8583cf55/raw/d93a104817103e4330619aa93506c424cfd5ae16/file.qrel'
qrel_file = 'file.qrel'
urllib.request.urlretrieve(qrel_file_path, qrel_file)

run_file_path = 'https://gist.githubusercontent.com/sergeyf/4d88da8d865ccad06cfd140b8583cf55/raw/d93a104817103e4330619aa93506c424cfd5ae16/file.run'
run_file = 'file.run'
urllib.request.urlretrieve(run_file_path, run_file)


###### trectools

qrel = TrecQrel(qrel_file)
run = TrecRun(run_file)

trec_eval = TrecEval(run, qrel)

trectools_results = {'Rprec': np.round(100 * trec_eval.get_rprec(), 2),
                     'ndcg': np.round(100 * trec_eval.get_ndcg(), 2)}



###### pytrec_eval
def get_metrics(qrel_file, run_file, metrics=('ndcg', 'Rprec')):
    with open(qrel_file, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(run_file, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)
        
    evaluator = pytrec_eval.RelevanceEvaluator(qrel, set(metrics))
    results = evaluator.evaluate(run)

    out = {}
    for measure in sorted(metrics):
        res = pytrec_eval.compute_aggregated_measure(
                measure, 
                [query_measures[measure]  for query_measures in results.values()]
            )
        out[measure] = np.round(100 * res, 2)
    return out


pytrec_eval_results = get_metrics(qrel_file, run_file)

print('trectools_results', trectools_results)
print('pytrec_eval_results', pytrec_eval_results)

RRF limited to 1000 docs?

RRF allows user to specify max_docs:

trectools/trectools/fusion.py

Line 106 in 9e76486

def reciprocal_rank_fusion(trec_runs, k=60, max_docs=1000, output=sys.stdout):

However, it seems only up to 1000 hits:

trectools/trectools/fusion.py

Line 123 in 9e76486

docs_for_run = r.get_top_documents(topic, n=1000)

Shouldn't max_docs be passed into n?

pip install fails "No such file or directory: 'README.md'"

$ pip install trectools                                                                                                        (py3)  Wed May  8 15:45:19 2019
Collecting trectools
  Using cached https://files.pythonhosted.org/packages/34/f2/a2b7238f45ebf9ce616c2ff50c822372f47962b0c98511c2cd251d1948b2/trectools-0.0.38.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-wh2iivbh/trectools/setup.py", line 30, in <module>
        long_description=open('README.md').read()
    FileNotFoundError: [Errno 2] No such file or directory: 'README.md'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-wh2iivbh/trectools/

I can install with pip install git+https://github.com/joaopalotti/trec_tools.git or pip install trectools==0.0.37

Run example 0 with NDCG@5

Is it possible to run this example with ndcg@5 instead of P@10? Simply replacing the "P_10" into "NDCG_5" does not seem to work.

from trectools import TrecQrel, procedures

qrels_file = "./robust03/qrel/robust03_qrels.txt"
qrels = TrecQrel(qrels_file)

# Generates a NDCG@5 graph with all the runs in a directory
path_to_runs = "./robust03/runs/"
runs = procedures.list_of_runs_from_path(path_to_runs, "*.gz")

results = procedures.evaluate_runs(runs, qrels, per_query=True)
ndcg_5 = procedures.extract_metric_from_results(results, "NDCG_5")
fig = procedures.plot_system_rank(ndcg_5, display_metric="NDCG@5", outfile="plot.pdf")
fig.savefig("plot.pdf", bbox_inches='tight', dpi=600)
# Sample output with one run for each participating team in robust03:

Bump supported Terrier version

Hi,

Your command match Terrier v4 (which Terrier 5 and 5.1 also support). I'm happy to provide input about changing the commandline incantations for Terrier v5.

Craig

recall support

Hi,
Is recall at different depths supported by the tool?

Average over the queries in the intersection of relevance judgements and results

Hello, I'm recently using the trectools and find that the evaluation is not the same as trec_eval. In trec_eval, the evaluation is always averaged over the queries in the intersection of relevance judgements and results. However, in trectools, the evaluation is averaged over all queries in the results, and some of them may not be judged; they should be ignored rather than contribute to a value of 0.

Querying IR Systems

Would you accept a PR if I integrate PISA as an IR system?

Add set-based measures (`trec_eval -m set`)

For an overview of the missing measures, one can run trec_eval -h -m set.

MAP function does not consider properly ranking order

I think the method get_map() within trec_eval.py does not take into account ranking order properly.
In line 337 the run_data dataframe is ordered by the value of query, score and docid (does not take into account rank):
trecformat = self.run.run_data.sort_values(["query", "score", "docid"], ascending=[True,False,False]).reset_index()

Then, ranking column is artificially created in lines 346-347:
topX["rank"] = 1 topX["rank"] = topX.groupby("query")["rank"].cumsum()

iprec_at_recall_LEVEL is missing

Among all the trec_eval default evaluation metrics, iprec_at_recall_LEVEL is the only one missing and needs to be implemented.

Change runtag of a run

It'd be nice to be able to change the runtag of a run. For example, I load two runs, fuse them together, and then write out the run file. I'd like to set my own runtag.

Add MSMARCO-based topic files

I think this should be a fairly straight forward addition, each topic in MSMARCO is just a topic id and query separated by tab. Example:

1108939	what slows down the flow of blood
1112389	what is the county for grand rapids, mn
792752	what is ruclip
1119729	what do you do when you have a nosebleed from having your nose
1105095	where is sugar lake lodge located
1105103	where is steph currys home in nc
1128373	iur definition
1127622	meaning of heat capacity
1124979	synonym for treatment
885490	what party is paul ryan in

Using the aliases of builtin types like np.int is deprecated

https://numpy.org/devdocs/release/1.20.0-notes.html

Fusion using combos

Hi,
I was trying to get fused runs. I managed to do it perfectly fine with reciprocal_rank_fusion with the example you showed, but when trying combos function the first issue I noticed is that it does not return a TrecRun object as reciprocal_rank_fusion do so I had to convert to a TrecRun object myself.
fused_run=TrecRun(fused_run)
but I am getting the below error

   f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Kindly advise.

Extend TrecTopics - query inside topic tag

In this year edition o f [TREC-CT] (http://www.trec-cds.org/2021.html) the topics are formatted as follows:

<topics task="2021 TREC Clinical Trials">
  <topic number="-1">
    A 2-year-old boy is brought to the emergency department by his parents for 5 days of high fever
    and irritability. The physical exam reveals conjunctivitis, strawberry tongue, inflammation of
    the hands and feet, desquamation of the skin of the fingers and toes, and cervical
    lymphadenopathy with the smallest node at 1.5 cm. The abdominal exam demonstrates tenderness
    and enlarged liver. Laboratory tests report elevated alanine aminotransferase, white blood cell
    count of 17,580/mm, albumin 2.1 g/dL, C-reactive protein 4.5 mg, erythrocyte sedimentation rate
    60 mm/h, mild normochromic, normocytic anemia, and leukocytes in urine of 20/mL with no bacteria
    identified. The echocardiogram shows moderate dilation of the coronary arteries with possible
    coronary artery aneurysm.
  </topic>
</topics>

Thus it is not compatible with the current implementation of trec_topics.py since it looks for a query_tag within the topic tag, which in this case does not exist. Here I suggest that querytext_tag="query is initalized as None, and if None, the following line query = topic.findNext(querytext_tag).getText() turns into query = topic.getText()

I can handle and send a PR if my suggestion seems like a good solution.

Score ties in run files lead to non-stable pools & evaluation

Problem Description

I recently used trectools to evaluate run files which had score ties in them, i.e. multiple documents with the same score given by the retrieval system.

Right now, when reading a run file, the documents are sorted by score. This leads to a problem, as the default pandas sorting algorithm is quick sort, which is not stable. Therefore, the case can arise that the order of document IDs, and thus the top-X documents used in multiple places throughout trectools, is not guaranteed to be the same every time. In my case, this created the issue of a top-5 evaluation having incomplete coverage on a qrel file created from the top-5 pool of the same run files.

Proposed Solutions

As pandas only allows unstable sorting when sorting by multiple columns, the only solution I see would be to add rank as third sorting axis to keep the original order in case of score ties: self.run_data.sort_values(["query","score","rank"], inplace=True, ascending=[True,False,True])

Enforce explicit dtypes throughout all DataFrame-based classes

Right now, both TrecRun and TrecQrel have whatever datatypes pandas automatically infers for each column.

trectools/trectools/trec_run.py

Line 43 in 5c1d56e

 self.run_data = pd.read_csv(filename, sep="\s+", names=["query", "q0", "docid", "rank", "score", "system"]) 

This leads to inconsistencies across different experiments based on the doc_id and query names used: some get inferred as int64 (for example MSMARCO-Passage), some as str (for example ClueWeb12). While this is not a problem when working in the "intended" file-based way of loading and evaluating runs/qrels, it incurs some frustration when accessing the TrecRun.run_data and TrecQrel.qrels_data fields directly in in-memory experiments (related issue: #13).

Here, the experiment code has to be customized to the inferred pandas-dtype for each collection. For example, when an in-memory run is to be evaluated with qrels loaded from disk, one has to make sure that the custom run.run_data matches the dtypes of the qrels, which may change based on the collection used. In my case, I assumed id fields to be string-based and ran into a lot of merge-on-different-dtypes errors.

Therefore, I would argue towards enforcing a consistent dtype (str) when loading runs/qrels from disk for the doc_id and query columns. This would not affect the file-based workflow in any way but greatly improve the memory-based experience and usability by providing a consistent type mapping for all collections.

Add ability to compute trec_eval metrics directly on in-memory data structures

From you paper:

This is now pretty easy... with pyserini on PyPI.

But the real point of this issue is this: currently, as a I understand it, the input to evaluation is a file. Can we make it so that we can compute evaluation metrics directly from in-memory data structures?

Question is, what should the in-memory data structures look like? A Panda DF with the standard trec output format columns? A dictionary to support random access by qid? Something else?

If we can converge on something, I can even try to volunteer some of my students to contribute to this effort... :)

What is Robust03?

Could you clarify which one is Robust03?
I only know Robust04 (disk4-5 without cr folder) and Robust05 (AQUAINT)

Malformed lines

Hi @joaopalotti,
It seems that for malformed lines, such as:

trec_eval throws:
trec_eval.get_results: Malformed line 790
While trec_tools does not.

I am not sure that this is a bad thing but perhaps a warning will suite here?
Unfortunately I am a bit swamped lately so I am not available to offer a fix myself.
Thank you!

Quesstion about how to fuse multiple ranking results (please help)

Great library, thanks for all the excellent work!

I'm trying to fuse 5 CSV files generated by 5 ranking models, after reading the documentation(example 6) I still don't know how to achieve it, could you please help me with it?

For example, file model_1.csv:

search_id,item_id
1,45
1,3
1,4
1,90
2,5
2,54
2,76

and file model_2.csv:

search_id,item_id
1,45
1,4
1,3
1,78
2,5
2,93
2,54

Note: different models may return a different set of item_id for an individual search,(e.g., item_id 90 appears in model_1 for search_id 1, but not in model_2 for search_id 1), and every model has a validation score(NDCG), does the val score help? How should I use it(as weight maybe)?

Could you provide some example code to show how I can achieve this(how to read the csv files as a TrecRun and fuse them, using the validation score as weight if it's possible)? And what kind of fusion is appropriate for this kind of task? (It's about fusing the ranking of the results of a search of hotels)

Question on TrecQrel relevance

In the README the relevance in TrecQrel format is described as "how relevant is docno for qid" and the example below has values 0, 1 and 2 for it. I'd have three question on this:

Is 0 the most relevant docno and 2 the least relevant or vice versa?
Can I put any non-negative integer as the relevance?
Are float numbers supported?

For context, for each query I have a list of documents and a number how many times each document was opened by a user (hits) after the query. Now I'm not sure if I can use the hits directly as the relevance number, or if I should scale it or just order documents based on it and then use the index.

Python 3 support?

Does this has python3 support?

Recall at depth K missing

The Evaluation Measures part of the readme mentions that the project has an implementation of recall at depth K, but I couldn't find one.

Command not found: trec_eval

It seems the NIST trec_eval executable is not installed through pip install trectools

    656             except OSError as e:
    657                 if e.errno == errno.ENOENT:
--> 658                     raise ValueError('Command not found: %s' % self.args[0])
    659             except Exception as e:  #pragma: no cover
    660                 logger.exception('Popen call failed: %s: %s', type(e), e)

ValueError: Command not found: trec_eval

for qrel, 0 or QO?

In the main readme, it says the qrels format should be "qid 0 docid rel," but in trec_qrel:
def read_qrel(self, filename, qrels_header=None): # Replace with default argument for qrel_header if qrels_header is None: qrels_header = ["query", "q0", "docid", "rel"]
so it seems like it should be Q0, instead of 0. Should it be Q0, or perhaps doesn't matter? Thanks.

joaopalotti / trectools Goto Github PK

trectools's People

Stargazers

Watchers

Forkers

trectools's Issues

Problem Description

Proposed Solutions

Recommend Projects

Recommend Topics

Recommend Org