microsoft / openkp Goto Github PK

Automatically extracting keyphrases that are salient to the document meanings is an essential step to semantic document understanding. An effective keyphrase extraction (KPE) system can benefit a wide range of natural language processing and information retrieval tasks. Recent neural methods formulate the task as a document-to-keyphrase sequence-to-sequence task. These seq2seq learning models have shown promising results compared to previous KPE systems The recent progress in neural KPE is mostly observed in documents originating from the scientific domain. In real-world scenarios, most potential applications of KPE deal with diverse documents originating from sparse sources. These documents are unlikely to include the structure, prose and be as well written as scientific papers. They often include a much diverse document structure and reside in various domains whose contents target much wider audiences than scientists. To encourage the research community to develop a powerful neural model with key phrase extraction on open domains we have created OpenKP: a dataset of over 150,000 documents with the most relevant keyphrases generated by expert annotation.

Home Page: https://microsoft.github.io/OpenKP/

License: MIT License

Python 100.00%

openkp's Issues

Description on the format for <candidate file> <reference file> for evaluation

The evaluation script expects 2 files. Can someone explain the format and use case of each?

Error in precision@K calculation

OpenKP/evaluate.py

Line 49 in 1207c7c

precision.append(true_positive/float(i + 1))

Shouldn't this be:
precision.append(100*true_positive/min(float(i + 1), len(candidates)))

How can use the OpenKP model for prediction?

Which script among these can help with the prediction and what is the structure of input files expected?
MakeOpenKP.py
makepredsform.py

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Normalization of answers.

Hi, when we're doing the case study using the evaluation script, we got some confusion about the results.

We find that you have the normalize_answer function. Several rules are set to normalize the candidates and references. However, we found these methods are not applied to the answers: in the return statement, these separate rules are not even called. This caused at least the punctuation problem: no punctuations are included in your raw text field of the dataset, but some punctuations exist in your keyphrases which makes it impossible for the models to predict the "exactly matched keyphrases".
Refer to:

OpenKP/evaluate.py

Lines 7 to 17 in 6e1d770

 def normalize_answer(s): 

 def remove_articles(text): 

 return re.sub(r'\b(a|an|the)\b', ' ', text) 

 def white_space_fix(text): 

 return ' '.join(text.split()) 

 def remove_punc(text): 

 exclude = set(string.punctuation) 

 return ''.join(ch for ch in text if ch not in exclude) 

 def lower(text): 

 return text.lower() 

 return ' '.join([lower(x) for x in s]).rstrip()

We notice that you filter keyphrases beginning with an empty string "". However, there're some(quite a few) keypharses in the dataset which are not exactly empty like ["", "Middle", "East", "&", "Jewish", "World"] in line 6109 of dev.jsonl. So is there a reason to discard these kind of keyphrases?
Refer to:

OpenKP/evaluate.py

Lines 19 to 25 in 6e1d770

def remove_empty(a_list):

new_list = []

for i in a_list:

if len(i) > 0:

if len(i[0]) >0:

new_list.append(normalize_answer(i))

return new_list

where to get the dataset

I want to get the OpenKP dataset, but I can't find it in https://microsoft.github.io/msmarco/ or msmarco.org. Is it removed?

Code for VDOM

Can you please share your code for generating the VDOM?

Release testset

Hi MARCO team,

I appreciate your effort in creating this great dataset and maintaining the leaderboard. Since the leaderboard for OpenKP has been retired, I wonder if there is any plan releasing the test dataset, so that the research community can continue studies with it.

Thank you,
Rui Meng

where is the data?

I am unable to find the data to download. Can you please point me to the location?

The website for dataset downloading seems to be down

Hi OpenKP Team,

I tried to visit http://www.msmarco.org/dataset.aspx, but it seems not to work at this moment.

Thanks,
Jiaying

microsoft / openkp Goto Github PK

openkp's Issues

Description on the format for <candidate file> <reference file> for evaluation

Error in precision@K calculation

How can use the OpenKP model for prediction?

This repo is missing important files

Normalization of answers.

where to get the dataset

Code for VDOM

Release testset

where is the data?

The website for dataset downloading seems to be down

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def normalize_answer(s):
	def remove_articles(text):
	return re.sub(r'\b(a\|an\|the)\b', ' ', text)
	def white_space_fix(text):
	return ' '.join(text.split())
	def remove_punc(text):
	exclude = set(string.punctuation)
	return ''.join(ch for ch in text if ch not in exclude)
	def lower(text):
	return text.lower()
	return ' '.join([lower(x) for x in s]).rstrip()

	def remove_empty(a_list):
	new_list = []
	for i in a_list:
	if len(i) > 0:
	if len(i[0]) >0:
	new_list.append(normalize_answer(i))
	return new_list