jacklxc / corwa Goto Github PK

View Code? Open in Web Editor NEW

16.0 4.0 2.0 34.74 MB

CORWA: A Citation-Oriented Related Work Annotation Dataset, NAACL 2022

Python 41.93% Jupyter Notebook 58.07%

natural-language-processing related-work-generation scientific-documents citation-text-generation

corwa's People

Contributors

Stargazers

Watchers

Forkers

dankoan

corwa's Issues

Dataset cited paper in acl/pdf_parses.jsonl

I followed your instruction to extract the pdf_parses for papers in ACL, from the S2ORC dataset, obtaining a "20200705v1/acl/pdf_parses.jsonl". However, I notice that not all cited and citing papers in your dataset (e.g., test dataset CORWA_test.jsonl) can be found in this file (i.e., the ids listed in the CORWA_test.jsonl for both cited and citing papers are not found in acl/pdf_parses.jsonl). In the LED generation process, you need to take as input the abstract / introduction of both cited and citing paper, and if I understand correctly, you at least need to extract the abstract and introduction of all the citing and cited papers in the test dataset, from any pdf_parses file. Should I just scan through the entire "20200705v1/full/pdf_parses/* " to obtain such information? Thank you very much in advance!

Confusion regarding citation span detection

Hello, first of all thanks for the nice dataset. One part of your paper that caught my attention is Section 3.1.2 Citation Span Detection, where you defined a citation span as "the span of text whose information is directly derived from a specific cited paper". To my understanding, the annotation protocol relevant to this section is like so:

If the cited paper is explained, then the annotators were to label the explanation within the citing paper. The explanation may be only part of a sentence or go across sentence boundaries.
If the cited paper is not explained, then the annotators were to label the citation mark for the cited paper.

Here's an example of what I think counts as the first case:

data/annotated_train/10011032.txt, Line 37: [BOS] Zhang and Clark (2008) proposed an incremental joint segmentation and POS tagging model, with an effective feature set for Chinese.
data/annotated_train/10011032.ann, Line 54: T56 Dominant 4577 4599 Zhang and Clark (2008)

I expected the annotation to be "incremental joint segmentation and POS tagging model, with an effective feature set for Chinese" instead. Did I maybe understand the explanation of the annotation protocol wrongly? Looking forward to your response.

Example of `related_work.jsonl` file

I am trying to run your model on my own dataset and was wondering if it would be possible for you to share your related_work.jsonl file, to use as a reference for the data structure.

Thanks!

jacklxc / corwa Goto Github PK

corwa's People

Contributors

Stargazers

Watchers

Forkers

corwa's Issues

Dataset cited paper in acl/pdf_parses.jsonl

Confusion regarding citation span detection

Example of `related_work.jsonl` file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent