ntunlplab Goto Github PK

repos: 48.0 gists: 0.0

Name: NTU NLP Lab

Type: Organization

Bio: Natural Language Processing Laboratory, National Taiwan University

Location: Taiwan

Blog: http://nlg.csie.ntu.edu.tw

NTU NLP Lab's Projects

amdrd

Analysis Model of Discourse Relations within a Document(AMDRD)

c2rc2

Categorizing Citation Relations in Scientific Papers Based on the Contributions of Cited Papers

chinese-word-ordering-errors-detection-and-correction-corpus

Word Ordering Errors (WOEs) are the most frequent type of grammatical errors at sentence level for non-native Chinese language learners. Learners taking Chinese as a foreign language often place character(s) in the wrong places in sentences, and that results in wrong word(s) or ungrammatical sentences. Besides, there are no clear word boundaries in Chinese sentences.

citation-intent-classification-evidence-extraction-

Citation Intent Classification and Its Supporting Evidence Extraction for Citation Graph Construction

contrastive-debate-representation

Contrastively learning participant representations per round in thread-based debates.

contributionsum

The ContributionSum Dataset

convlogrecaller-dataset

ConvLogRecaller-dataset

dialogue-mpdd

A dialogue dataset is an indispensable resource for building a dialogue system. Additional information like emotions and interpersonal relationships labeled on conversations enables the system to capture the emotion flow of the participants in the dialogue. However, there is no publicly available Chinese dialogue dataset with emotion and relation labels. In this paper, we collect the conversions from TV series scripts, and annotate emotion and interpersonal relationship labels on each utterance. This dataset contains 25,548 utterances from 4,142 dialogues. We also set up some experiments to observe the effects of the responded utterance on the current utterance, and the correlation between emotion and relation types in emotion and relation classification tasks.

discourse-chinese-discourse-parser-acl2020

code

discourse-ntu-chinese-discourse-resources

fecs

finance-fin-some

Fin-SoMe is a dataset with 10,000 labeled financial tweets annotated by experts from both the front desk and the middle desk in a bank's treasury. These annotated results reveal that (1) writer-labeled market sentiment may be a misleading label; (2) writer's sentiment and market sentiment of an investor may be different; (3) most financial tweets provide unfounded analysis results; and (4) almost no investors write down the gain/loss results for their positions.

finance-finnum

Numeral is the crucial part of financial documents. In order to understand the detail of opinions in financial documents, we should not only analyze the text, but also need to assay the numeric information in depth. Because of the informal writing style, analyzing social media data is more challenging than analyzing news and official documents. FinNum is a dataset for fine-grained numeral understanding in financial social media data - to identify the category of a numeral.

finance-finprolex

FinProLex provides 5,162 tokens in professional analysts' reports and the financial social media platform posts with expert-like scores. The expert-like scores are calculated based on the pointwise mutual information (PMI).

finance-icrd

There are two tasks in the ICRD. We separate the datasets into three parts, including Train/Dev/Test. (1) Premise Detection In the premise detection task, we aim at identifying whether the given sentence is a premise. There are two keys for each instance. "sentence" is the given sentence. If the value of "ans" is 0, means the given sentence is not a premise. If the value of "ans" is 1, means the given sentence is a premise. (2) Claim-Premise Inference When given a claim and a sentence, models are asked to predict whether the given sentence is the premise of the claim. There are three keys for each instance. "claim" is the given claim and "compare_sent" is the other given sentence. If the value of "ans" is 0, means the given sentence is not a premise of the given claim. If the value of "ans" is 1, means the given sentence is a premise of the given claim.

finance-ntusd-fin

NTUSD-Fin provides various scoring methods including frequency, CFIDF, chi-squared value, market sentiment score and word vector for the tokens. Only the tokens appeared at least ten times and shown significantly difference between expected and observed frequency with chi-squared test are remained in our dictionary. The predetermined significance level is 0.05. The market sentiment score is calculated by substracting the bearish PMI from the bullish PMI. There are 8,331 words, 112 hashtags and 115 emojis in the constructed dictionary, NTUSD-Fin.

finance-numattach

Numeral is the crucial part of financial documents. In order to understand the detail of opinions in financial documents, we should not only analyze the text, but also need to assay the numeric information in depth. Because of the informal writing style, analyzing social media data is more challenging than analyzing news and official documents. NumAttach is a dataset for fine-grained numeral understanding in financial social media data - to detect the relation between cashtag and the numeral.

finance-numclaim

Numerals provide important information in financial narratives. Our statistical result in the financial analysis reports shows that over 58.47% of sentences contain at least one numeral. Without the numerals, lots of fine-grained information in the analysis reports will be lost. This phenomenon evidences the importance of the numerals in the financial narrative. Based on our observation, investors always make a claim with an estimation. This estimation can be a cue for detecting the investor's fine-grained claim. Therefore, we propose an expert-annotated dataset, NumClaim, for probing argument mining in the financial narrative. Among 5,144 instances in the NumClaim dataset, 23.78% and 76.22% of instances containing numerals are annotated as In-claim'' and Out-of-claim'', respectively.

finance-numeracy-600k

Numeral is the crucial part of in narrative, especially in financial documents. We should not only analyze the text, but also need to assay the numeric information in depth. Numeracy-600K is a dataset for testing the numeracy of machines.

framenet-cfn-lex

A total of 36K lexical units that cover 779 frames for FrameNet in Chinese. This resource is extracted from a large-scale bilingual corpus to achieve higher coverage in terms of lexical units, which is helpful in providing frame recommendations for annotation campaigns or constructing robust frame identification systems.

framenet-cfn-sp

This system is traind on FrameNet subset of 31 frames ('Arriving', 'Accompaniment', 'Visiting', 'Discussion', 'Meet_with', 'Presence', 'Ingestion', 'Ride_vehicle', 'Perception_active', 'Sleep', 'Competition', 'Attending', 'Giving', 'Text_creation', 'Transitive_action', 'Resolve_problem', 'Statement', 'Receiving', 'Taking_time', 'Social_event', 'Departing', 'Deciding', 'Arranging', 'Waiting', 'Perception_experience', 'Contacting', 'Borrowing', 'Commerce_buy', 'Questioning', 'Activity', 'Inspecting') that could fulfill daily events for lifelogging. The overall performance for frame semantic parsing of our system is F1 score 97.12 and 85.14 for training and testing respectively.

heterogeneous_argument_attention_network

icda

Interactive Clinical Diagnostic Assistant for Medical Interview

lifeeventdialog

Life Event Dialog contains fine-grained personal life event annotations on DailyDialog.

lifelog-dialog

Conversation, a common way for people to share their experiences and feelings with others, consists of important information about personal life events of individuals, but is rarely explored. In this dataset, we initiate a task of detecting personal life events from daily conversaion. We extend a multi-turn dialog dataset, DailyDialog, with life event annotation. We collect 600 conversations with 4-6 utterances from 4 topics of DailyDialog. Our goal is to detect the life events of each speaker in real-time.

lifelog-fpimgcaphrc

lifelog-livekb

People often forget something in the daily life, thus information recall support for people at the right time and at the right place is emerging. Constructing personal knowledge base for individuals is important for the application of memory recall and living assistance. We collect 18 users who set their tweets as public and posted tweets ranged from 2009 to 2017. We aim to extract life events from tweets shared on Twitter, and construct personal knowledge bases of individuals.

lifelog-pkbqac-dataset

A Dataset for Personal Knowledge Base Question Ansewring and Unanswerable Question Correction

ntunlplab Goto Github PK

NTU NLP Lab's Projects

Recommend Projects

Recommend Topics

Recommend Org