Comments (3)
Hi, thank you for sharing the data and code!
I just found that it seems that an input word is not correctly tokenized by the word tokenizer:
in the word_tokenizer.py file
Each word is directly converted to token idfor raw_tokens in raw_tokens_list: indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokens)However, a word could be tokenized into word pieces by
for raw_tokens in raw_tokens_list: for word in raw_tokens: word_tokens = self.tokenizer.tokenize(word)Directly converting word to token id will lead to lots of [UNK] and make the performance drop a lot.
Thanks, we will fix it soon.
from few-nerd.
Hi. I have implemented the tokenization function. The performance for the prototypical network in (inter) 5 way 5~10 shot setting bumped up from 52.42 to 60.09. Similarly, for (inter) 5 way 1~2 shot performance bumped from 37.49 to 44.45. These are from single runs only but I think it is worth changing the tokenization.
Here is my implementation
def tokenize(self, raw_tokens, tags):
raw_tokens = [token.lower() for token in raw_tokens]
indexed_tokens_list = []
tag_list = []
mask_list = []
text_mask_list = []
curr_split = ["[CLS]"]
tag_split = []
mask_split = np.zeros((self.max_length), dtype=np.int32)
text_mask_split = np.zeros((self.max_length), dtype=np.int32)
for i, (word, tag) in enumerate(zip(raw_tokens, tags)):
tokens = self.tokenizer.tokenize(word)
if len(curr_split) + len(tokens) >= self.max_length:
indexed_tokens = self.tokenizer.convert_tokens_to_ids(curr_split + ['[SEP]'])
while len(indexed_tokens) < self.max_length:
indexed_tokens.append(0)
mask_split[:len(indexed_tokens)] = 1
indexed_tokens_list.append(indexed_tokens)
tag_list.append(tag_split)
mask_list.append(mask_split)
text_mask_list.append(text_mask_split)
curr_split = ['[CLS]']
tag_split = []
mask_split = np.zeros((self.max_length), dtype=np.int32)
text_mask_split = np.zeros((self.max_length), dtype=np.int32)
text_mask_split[len(curr_split)] = 1
curr_split.extend(tokens)
tag_split.append(tag)
if tag_split:
indexed_tokens = self.tokenizer.convert_tokens_to_ids(curr_split + ['[SEP]'])
while len(indexed_tokens) < self.max_length:
indexed_tokens.append(0)
mask_split[:len(indexed_tokens)] = 1
indexed_tokens_list.append(indexed_tokens)
tag_list.append(tag_split)
mask_list.append(mask_split)
text_mask_list.append(text_mask_split)
return indexed_tokens_list, mask_list, text_mask_list, tag_list
from few-nerd.
Hi. I have implemented the tokenization function. The performance for the prototypical network in (inter) 5 way 5~10 shot setting bumped up from 52.42 to 60.09. Similarly, for (inter) 5 way 1~2 shot performance bumped from 37.49 to 44.45. These are from single runs only but I think it is worth changing the tokenization.
Here is my implementation
def tokenize(self, raw_tokens, tags): raw_tokens = [token.lower() for token in raw_tokens] indexed_tokens_list = [] tag_list = [] mask_list = [] text_mask_list = [] curr_split = ["[CLS]"] tag_split = [] mask_split = np.zeros((self.max_length), dtype=np.int32) text_mask_split = np.zeros((self.max_length), dtype=np.int32) for i, (word, tag) in enumerate(zip(raw_tokens, tags)): tokens = self.tokenizer.tokenize(word) if len(curr_split) + len(tokens) >= self.max_length: indexed_tokens = self.tokenizer.convert_tokens_to_ids(curr_split + ['[SEP]']) while len(indexed_tokens) < self.max_length: indexed_tokens.append(0) mask_split[:len(indexed_tokens)] = 1 indexed_tokens_list.append(indexed_tokens) tag_list.append(tag_split) mask_list.append(mask_split) text_mask_list.append(text_mask_split) curr_split = ['[CLS]'] tag_split = [] mask_split = np.zeros((self.max_length), dtype=np.int32) text_mask_split = np.zeros((self.max_length), dtype=np.int32) text_mask_split[len(curr_split)] = 1 curr_split.extend(tokens) tag_split.append(tag) if tag_split: indexed_tokens = self.tokenizer.convert_tokens_to_ids(curr_split + ['[SEP]']) while len(indexed_tokens) < self.max_length: indexed_tokens.append(0) mask_split[:len(indexed_tokens)] = 1 indexed_tokens_list.append(indexed_tokens) tag_list.append(tag_split) mask_list.append(mask_split) text_mask_list.append(text_mask_split) return indexed_tokens_list, mask_list, text_mask_list, tag_list
Thanks, this is very helpful, we will update the results soon
from few-nerd.
Related Issues (20)
- 为什么eval的时候老用cpu跑啊? HOT 1
- Cannot ingest data into either train_demo or pre-processing HOT 1
- UnboundLocalError: local variable 'label' referenced before assignment HOT 2
- 您好,想问一下data_split.py和processing.py是干嘛的呢 HOT 1
- Code and dataset license HOT 1
- 您好,请问论文中提到的给few-nerd注释的接口代码是否公开了呢 HOT 4
- 您好,请问fewshot训练的时候加不加--use_sampled_data结果没有区别吧?只是简化训练过程了是嘛 HOT 1
- 您好,我在bash run_supervised.sh的时候出现了如下报错,请问可能是什么原因呢? HOT 2
- 您好,请教一个关于few-shot实验的细节问题。 HOT 1
- 无法下载按照episode采样好的数据集 HOT 1
- Why this dataset use IO scheme? HOT 1
- Using custom dataset HOT 2
- episode 链接失效了
- Structshot
- Regarding Data-set in inter/intra folder HOT 4
- How to do inference on my custom data after training on FewNERD data? HOT 13
- What is the difference between episode-data model training and non-episode-data. HOT 5
- How to create few shot episode-data for training and test from the general custom NER data?
- 如何使用模型
- I've got ERROR 404:Not Found when I try to download through bash file HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from few-nerd.