ink-usc / triggerner Goto Github PK

View Code? Open in Web Editor NEW

172.0 11.0 19.0 2.27 MB

TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)

Home Page: https://arxiv.org/abs/2004.07493

Python 100.00%

named-entity-recognition dataset nlp-resources nlp-datasets information-extraction sequence-tagging low-resource

triggerner's People

Contributors

Stargazers

Watchers

Forkers

gztangde codefly13 remonly caoxu915683474 phychaos xrosliang wurentidai tangxuemei1995 ppintelligence p16i yueyedeai gavin90s hell-to-heaven lyn2018lyn cancangit zhangxq9944 midnight93 jpegah iq-scm

triggerner's Issues

Runtime Issues

Run directly after downloading the code ，error in the picture has occurred ,how can it be solved?
Thank you very munch.
python naive.py 运行代码时出现了图片中的问题，求助大佬解决，在此非常感谢。

I run your new code. It seems that percentage also doesn't work in semi_supervised.py.
I run: python semi_supervised.py --device cuda:0 --dataset CONLL --percentage 20, test result F1 is 85.58.
I run: python semi_supervised.py --device cuda:1 --dataset CONLL --percentage 3, test result F1 is 86.39.

Changing tagging schema

Hi!
Thanks for your code :)

I am trying to change the tagging schema from IOB to IO, by implementing a simple function which replaces "B-" with "I-", but this alteration results in 0 values for precision, recall and f1 on dev and test sets.

This is the function I added to the Config class:

    def use_io(self, insts: List[Instance]) -> None:
        """
        Use IO tagging schema to replace the IOB tagging schema in the instance
        :param insts:
        :return:
        """
        for inst in insts:
            output = inst.output
            for pos in range(len(inst)):
                curr_entity = output[pos]
                if curr_entity.startswith(self.B):
                    output[pos] = curr_entity.replace(self.B, self.I)

I was wondering what causes this sharp drop of performance.

Understanding trigger files

Hello,

I have a question regarding the trigger files. I do not completely understand the numeric ids found next to the words that are considered as triggers. For instance:

EU	B-ORG
rejects	T-3
German	T-0
call	T-4
to	O
boycott	T-1
British	T-2
lamb	T-2
.	O

I understand that the words with a T are triggers regarding the entity EU. Thus, for each entity there are different triggers. However, what does mean the number next to the T? For a moment, I thought that the ids were the order in which tokens should be used. I have thought as well that they would group the triggers. A colleague though that it was different levels of triggers. But, I have seen that some examples do not contain a T-0 and, in some cases and the triggers are not numbered in specific pattern.

So, it is not completely clear the meaning of the numbers.

A question about experimental results

Hi, thanks for sharing your source code.

I have a question about the experimental results. If you use 100% sentences labeled with both entity tags and trigger
tags, what the performance will be?

Environment error

Hello, may I ask why this problem is caused？

读后想法

1、实体触发器就是一种匹配模式？如果符合此模式那么就可以得到实体？那么此模式是否可以理解为一个词性表示在这里。
2、此方法本体上还是研究如何加强一个特征表示。
3、有一个问题是：在预测的时候，如果出现不在标注的触发词表的样本集，就是出现测试样本的触发词与标注的大不相同，那么这个方法应该没有什么用处，所以还是标注触发词的规模能够覆盖多少，难以相信 Trigger Matching Networks能够学习到发现一个句子的触发词是什么，从而进行使用。

context_emd

how can I use the context_emb,Ineed your help

how can I achieve higher f1 value?

I run: python semi_supervised.py --device cuda:0 --dataset CONLL --percentage 20
but the result can only achieve 54.20(F1) in test set. Which command should I run can achieve over 80 F1 value?

Some questions

hello!
In the soft_matcher.py file, the index on line 140 should be the position of the trigger. If you convert the position information to Word (using IDx2Word [index]), this seems to be an error.What does trigger_list stand for?

Need help about paper and code

Hi，

I have just read the paper, and it's a nice work.

I have some questions:

1, To enable more efficient batch-based training, a sentence contains only one entity and one trigger? How much less efficient would it be if didn't limit this?

2, gs or gt is obtained by weighted sum of the token vectors, there's no representation of the sum in the formula ?

3, base model is CNN-BLSTM-CRF, there is no comparison in the results table ? Is that BLSTM-CRF for short?

4, What would happen if a full amount of trigger data be used?

Best Wishes

Question about the inconsistency between the paper and the code

I have a question about the inconsistency between the paper and the code:

In section 3.2, the paper saying "Note that the BLSTM used in the TrigEncoder and TrigMatcher modules is the same BLSTM we use in the SeqTagger to obtain H", however, I check the code in

TriggerNER/supervised.py

Lines 83 to 94 in c38df93

 encoder = SoftMatcher(conf, label_length) 

 trainer = SoftMatcherTrainer(encoder, conf, devs, tests) 

 # matching module training 

 random.shuffle(dataset) 

 trainer.train_model(conf.num_epochs_soft, dataset) 

 logits, predicted, triggers = trainer.get_triggervec(dataset) 

 triggers_remove = remove_duplicates(logits, predicted, triggers, dataset) 

 # sequence labeling module training 

 random.shuffle(dataset) 

 inference = SoftSequence(conf, encoder)

TriggerNER/model/soft_inferencer.py

Lines 19 to 29 in c38df93

 class SoftSequence(nn.Module): 

 def __init__(self, config, softmatcher, encoder=None, print_info=True): 

 super(SoftSequence, self).__init__() 

 self.config = config 

 self.device = config.device 

 self.encoder = SoftEncoder(self.config) 

 if encoder is not None: 

 self.encoder = encoder 

 self.softmatch_encoder = softmatcher.encoder 

 self.softmatch_attention = softmatcher.attention

TriggerNER/model/soft_inferencer.py

Lines 45 to 57 in c38df93

 def forward(self, word_seq_tensor: torch.Tensor, 

 word_seq_lens: torch.Tensor, 

 batch_context_emb: torch.Tensor, 

 char_inputs: torch.Tensor, 

 char_seq_lens: torch.Tensor, 

 trigger_position, tags): 

 batch_size = word_seq_tensor.size(0) 

 max_sent_len = word_seq_tensor.size(1) 

 output, sentence_mask, trigger_vec, trigger_mask = \ 

 self.encoder(word_seq_tensor, word_seq_lens, batch_context_emb, char_inputs, char_seq_lens, 

 trigger_position)

I believe that the BLSTM in SeqTagger is not the same with BLSTM in TrigEncoder and TrigMatcher. While training the SeqTagger, the H(output) is from a specific BLSTM in SeqTagger, nothing to do with the BLSTM in TrigEncoder and TrigMatcher. I can not agree with what you say in the paper and it really confuses me.

A question about trigger_list

Hello！Thanks for your code！But I still have a question about the trigger_list, in soft_matcher.py, 140 lines.
trigger_list.extend([" ".join(self.config.idx2word[index] for index in indices if index != 0) for indices in word_seq])
word_seq represents the positions of triggers in each sentence.(i.e. It will always smaller than seq_len). In that case, trigger_list will only includes the triggers in the first several sentences, which may be a error.
Did I get it wrong, I wonder?
Thanks for your code again！

AttributeError: 'SoftSequence' object has no attribute 'decode'

Hi,
I run your code semi_supervised.py found this error:
Traceback (most recent call last):
File "semi_supervised.py", line 191, in
main()
File "semi_supervised.py", line 188, in main
sequence_trainer.self_training(20, dataset, unlabeled_x)
File "/deepo_data/CZF/TriggerNER-master/model/soft_inferencer.py", line 219, in self_training
weaklabel, unlabel = self.weak_label_selftrain(unlabels, self.triggers)
File "/deepo_data/CZF/TriggerNER-master/model/soft_inferencer.py", line 283, in weak_label_selftrain
weakly_labeled, unlabeled, confidence = self.weakly_labeling(batched_data, unlabeled_data, triggers)
File "/deepo_data/CZF/TriggerNER-master/model/soft_inferencer.py", line 254, in weakly_labeling
batch_max_scores, batch_max_ids = self.model.decode(*batch[0:5], triggers)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 576, in getattr
type(self).name, name))
AttributeError: 'SoftSequence' object has no attribute 'decode'.

I found that the 'SoftSequence' object has no attribute 'decode' indeed. Is this a bug or a problem with my operation. Thanks!

if i use my own chinese corpus , how to design and create my own trigger txt?

Thanks a lot.

your method is amazing.

How to design and create my own trigger txt file if i wanna use my own set of Chinese corpus?

bug

Hi,
I think your code may have some problems about data split and model:

In line 101 of semi_supervised.py, 0.2 may be means percentage? I think you should use args.percentage.
In supervised.py, there is no code about data split. Percentage doesn't work at there.
In your paper, you use CNN-BiLSTM model. But your code use BiLSTM as char-level model.
I hope you can check it. Thank you a lot for your code.

	encoder = SoftMatcher(conf, label_length)
	trainer = SoftMatcherTrainer(encoder, conf, devs, tests)

	# matching module training
	random.shuffle(dataset)
	trainer.train_model(conf.num_epochs_soft, dataset)
	logits, predicted, triggers = trainer.get_triggervec(dataset)
	triggers_remove = remove_duplicates(logits, predicted, triggers, dataset)

	# sequence labeling module training
	random.shuffle(dataset)
	inference = SoftSequence(conf, encoder)

	class SoftSequence(nn.Module):
	def __init__(self, config, softmatcher, encoder=None, print_info=True):
	super(SoftSequence, self).__init__()
	self.config = config
	self.device = config.device
	self.encoder = SoftEncoder(self.config)
	if encoder is not None:
	self.encoder = encoder

	self.softmatch_encoder = softmatcher.encoder
	self.softmatch_attention = softmatcher.attention

	def forward(self, word_seq_tensor: torch.Tensor,
	word_seq_lens: torch.Tensor,
	batch_context_emb: torch.Tensor,
	char_inputs: torch.Tensor,
	char_seq_lens: torch.Tensor,
	trigger_position, tags):

	batch_size = word_seq_tensor.size(0)
	max_sent_len = word_seq_tensor.size(1)

	output, sentence_mask, trigger_vec, trigger_mask = \
	self.encoder(word_seq_tensor, word_seq_lens, batch_context_emb, char_inputs, char_seq_lens,
	trigger_position)