Giter Club home page Giter Club logo

few-nerd's Introduction

Few-NERD: Not Only a Few-shot NER Dataset

This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset. Check out the website of Few-NERD.

************************************* Updates *************************************

  • 09/03/2022: We have added the training script for supervised training using BERT tagger. Run bash data/download.sh supervised to download the data, and then run bash run_supervised.sh.

  • 01/09/2021: We have modified the results of the supervised setting of Few-NERD in arxiv, thanks for the help of PedroMLF.

  • 19/08/2021: Important💥 In accompany with the released episode data, we have updated the training script. Simply add --use_sampled_data when running train_demo.py to train and test on the released episode data.

  • 02/06/2021: To simplify training, we have released the data sampled by episode. click here to download. The files are named such: {train/dev/test}_{N}_{K}.jsonl. We sampled 20000, 1000, 5000 episodes for train, dev, test, respectively.

  • 26/05/2021: The current Few-NERD (SUP) is sentence-level. We will soon release Few-NERD (SUP) 1.1, which is paragraph-level and contains more contextual information.

  • 11/06/2021: We have modified the word tokenization and we will soon update the latest results. We sincerely thank tingtingma and Chandan Akiti

Contents

Overview

Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. Three benchmark tasks are built, one is supervised: Few-NERD (SUP) and the other two are few-shot: Few-NERD (INTRA) and Few-NERD (INTER).

The schema of Few-NERD is:

Few-NERD is manually annotated based on the context, for example, in the sentence "London is the fifth album by the British rock band…", the named entity London is labeled as Art-Music.

Requirements

 Run the following script to install the remaining dependencies,

pip install -r requirements.txt

Few-NERD Dataset

Get the Data

  • Few-NERD contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens.
  • We have splitted the data into 3 training mode. One for supervised setting-supervised, the other two for few-shot setting inter and intra. Each contains three files train.txtdev.txttest.txtsuperviseddatasets are randomly split. inter datasets are randomly split within coarse type, i.e. each file contains all 8 coarse types but different fine-grained types. intra datasets are randomly split by coarse type.
  • The splitted dataset can be downloaded automatically once you run the model. If you want to download the data manually, run data/download.sh, remember to add parameter supervised/inter/intra to indicate the type of the dataset

To obtain the three benchmark datasets of Few-NERD, simply run the bash file data/download.sh with parameter supervised/inter/intra as below

bash data/download.sh supervised

To get the data sampled by episode, run

bash data/download.sh episode-data
unzip -d data/ data/episode-data.zip

Data Format

The data are pre-processed into the typical NER data forms as below (token\tlabel).

Between	O
1789	O
and	O
1793	O
he	O
sat	O
on	O
a	O
committee	O
reviewing	O
the	O
administrative	MISC-law
constitution	MISC-law
of	MISC-law
Galicia	MISC-law
to	O
little	O
effect	O
.	O

Structure

The structure of our project is:

--util
| -- framework.py
| -- data_loader.py
| -- viterbi.py             # viterbi decoder for structshot only
| -- word_encoder
| -- fewshotsampler.py

-- proto.py                 # prototypical model
-- nnshot.py                # nnshot model

-- train_demo.py            # main training script

Key Implementations

Sampler

As established in our paper, we design an N way K~2K shot sampling strategy in our work , the implementation is sat util/fewshotsampler.py.

ProtoBERT

Prototypical nets with BERT is implemented in model/proto.py.

NNShot & StructShot

NNShot with BERT is implemented in model/nnshot.py.

StructShot is realized by adding an extra viterbi decoder in util/framework.py.

Note that the backbone BERT encoder we used for structshot model is not pre-trained with NER task

How to Run

Run train_demo.py. The arguments are presented below. The default parameters are for proto model on intermode dataset.

-- mode                 training mode, must be inter, intra, or supervised
-- trainN               N in train
-- N                    N in val and test
-- K                    K shot
-- Q                    Num of query per class
-- batch_size           batch size
-- train_iter           num of iters in training
-- val_iter             num of iters in validation
-- test_iter            num of iters in testing
-- val_step             val after training how many iters
-- model                model name, must be proto, nnshot or structshot
-- max_length           max length of tokenized sentence
-- lr                   learning rate
-- weight_decay         weight decay
-- grad_iter            accumulate gradient every x iterations
-- load_ckpt            path to load model
-- save_ckpt            path to save model
-- fp16                 use nvidia apex fp16
-- only_test            no training process, only test
-- ckpt_name            checkpoint name
-- seed                 random seed
-- pretrain_ckpt        bert pre-trained checkpoint
-- dot                  use dot instead of L2 distance in distance calculation
-- use_sgd_for_bert     use SGD instead of AdamW for BERT.
# only for structshot
-- tau                  StructShot parameter to re-normalizes the transition probabilities
  • For hyperparameter --tau in structshot, we use 0.32 in 1-shot setting, 0.318 for 5-way-5-shot setting, and 0.434 for 10-way-5-shot setting.

  • Take structshot model on inter dataset for example, the expriments can be run as follows.

5-way-1~5-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 64 --model structshot --tau 0.32

5-way-5~10-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 1 --trainN 5 --N 5 --K 5 --Q 5 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 32 --model structshot --tau 0.318

10-way-1~5-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 4 --trainN 10 --N 10 --K 1 --Q 1 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 64 --model structshot --tau 0.32

10-way-5~10-shot

python3 train_demo.py  --mode inter \
--lr 1e-4 --batch_size 1 --trainN 10 --N 10 --K 5 --Q 1 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 32 --model structshot --tau 0.434

Citation

If you use Few-NERD in your work, please cite our paper:

@inproceedings{ding-etal-2021-nerd,
    title = "Few-{NERD}: A Few-shot Named Entity Recognition Dataset",
    author = "Ding, Ning  and
      Xu, Guangwei  and
      Chen, Yulin  and
      Wang, Xiaobin  and
      Han, Xu  and
      Xie, Pengjun  and
      Zheng, Haitao  and
      Liu, Zhiyuan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.248",
    doi = "10.18653/v1/2021.acl-long.248",
    pages = "3198--3213",
}

License

Few-NERD dataset is distributed under the CC BY-SA 4.0 license. The code is distributed under the Apache 2.0 license.

Connection

If you have any questions, feel free to contact

few-nerd's People

Contributors

dependabot[bot] avatar ningding97 avatar yulinchen99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

few-nerd's Issues

Regarding Data-set in inter/intra folder

Inside data/episode-data/inter/ I see lot of training, test and dev data. I may be asking few silly questions , please pardon.

I was exploring train_5_5.jsonl. What does train_5_5.jsonl signifies? Does it has to do anything with support and query set?

Here is one example:

I see support has 14 sentences and Query has 15 sentences..

So, this one example mentioned below : support and query is fed into model as single example? Support is used to train the model? Then why Query is used?? I am seeing this training data structure for the first time. Could you give me more insights in lay man terms how training happens?

{
  "support":
          {"word":
                  [
                    ["averostra", ",", "or", "``", "bird", "snouts", "''", ",", "is", "a", "clade", "that", "includes", "most", "theropod", "dinosaurs", "that", "have", "a", "promaxillary", "fenestra", "(", "``", "fenestra", "promaxillaris", "``", ")", ",", "an", "extra", "opening", "in", "the", "front", "outer", "side", "of", "the", "maxilla", ",", "the", "bone", "that", "makes", "up", "the", "upper", "jaw", "."],
                    ["since", "that", "time", ",", "the", "squadron", "made", "several", "extended", "indian", "ocean", ",", "mediterranean", "sea", ",", "and", "north", "atlantic", "deployments", "as", "part", "of", "cvw-1", "/", "cv-66", ",", "until", "the", "decommissioning", "of", "uss", "``", "america", "''", "in", "1996", "."],
                    ["the", "alpha-gal", "allergy", "is", "believed", "to", "result", "from", "tick", "bites", "."],
                    ["interaction", "was", "shown", "to", "occur", "with", "the", "dna", "-directed", "rna", "polymerase", "ii", "subunit", ",", "rpb1", ",", "of", "rna", "polymerase", "ii", "during", "both", "mitosis", "and", "interphase", "."],
                    ["he", "is", "also", "responsible", "for", "programming", "on", "diablo", "ii", ",", "the", "development", "of", "the", "battle.net", "game", "server", "network", ",", "and", "the", "quake", "2", "mod", "loki", "'s", "minions", "capture", "the", "flag", "."],
                    ["minix", "was", "first", "released", "in", "1987", ",", "with", "its", "complete", "source", "code", "made", "available", "to", "universities", "for", "study", "in", "courses", "and", "research", "."],
                    ["terminal", "island", "is", "a", "low", "snow-covered", "island", "off", "the", "north", "tip", "of", "alexander", "island", ",", "in", "the", "bellingshausen", "sea", "west", "of", "palmer", "land", ",", "antarctic", "peninsula", "."],
                    ["among", "these", "were", "net/one", ",", "3+", ",", "banyan", "vines", "and", "novell", "'s", "ipx", "/", "spx", "."],
                    ["in", "1933\u20131970", ",", "a", "summer", "camp", "on", "south", "bass", "island", "operated", "for", "episcopal", "and", "anglican", "choristers", "."],
                    ["she", "is", "also", "the", "only", "cam", "ship", "whose", "fighter", "pilot", "died", "in", "action", "after", "his", "aircraft", "was", "launched", "from", "the", "ship", "."],
                    ["the", "department", "of", "social", "welfare", "and", "development", "(", "dswd", ")", "has", "distributed", "relief", "goods", "to", "residents", "of", "boracay", "while", "the", "island", "is", "closed", "to", "tourists", "."],
                    ["``", "rainbow", "``", "was", "scrapped", "in", "1940", "."],
                    ["it", "is", "the", "leading", "firm", "for", "the", "charlotte", "douglas", "international", "airport", "airfield", "expansion", ",", "the", "new", "dallas", "fort", "worth", "international", "airport", "southwest", "end-around", "taxiway", ",", "and", "master", "plan", "updates", "at", "philadelphia", "international", "airport", "and", "san", "antonio", "international", "airport", "."],
                    ["the", "event", "held", "at", "solberg-hunterdon", "airport", "is", "the", "largest", "summertime", "hot", "air", "balloon", "festival", "in", "north", "america", "."]
                  ],

          "label":
                  [
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "other-biologything", "other-biologything", "O", "O", "other-biologything", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "product-ship", "O", "product-ship", "O", "O", "O", "O", "O", "product-ship", "product-ship", "product-ship", "product-ship", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "other-biologything", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "other-biologything", "O", "other-biologything", "other-biologything", "other-biologything", "O", "O", "other-biologything", "O", "O", "other-biologything", "other-biologything", "other-biologything", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "O"],
                    ["O", "O", "O", "product-software", "O", "product-software", "O", "product-software", "product-software", "O", "product-software", "product-software", "product-software", "O", "product-software", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "location-island", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "product-ship", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "O", "O", "O", "O", "O", "O", "O", "O"],
                    ["O", "product-ship", "O", "O", "O", "O", "O", "O"],
                    ["O", "O", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "O", "building-airport", "building-airport", "building-airport", "building-airport", "O"],
                    ["O", "O", "O", "O", "building-airport", "building-airport", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
                  ]
              },

  "query":
          {"word":
                  [
                    ["the", "final", "significant", "change", "in", "the", "life", "of", "the", "coco", "2", "(", "models", "26-3134b", ",", "26-3136b", ",", "and", "26-3127b", ";", "16", "kb", "standard", ",", "16", "kb", "extended", ",", "and", "64", "kb", "extended", "respectively", ")", "was", "to", "use", "the", "enhanced", "vdg", ",", "the", "mc6847t1", ",", "allowing", "lowercase", "characters", "and", "changing", "the", "text", "screen", "border", "color", "."],
                    ["the", "reno-tahoe", "international", "airport", "reno-tahoe", "international", "airport", "(", "formerly", "known", "as", "the", "reno", "cannon", "international", "airport", ")", "is", "the", "other", "major", "airport", "in", "the", "state", "."],
                    ["it", "was", "built", "by", "cole", "palen", "for", "flight", "in", "his", "weekend", "airshows", "as", "early", "as", "1967", "and", "actively", "flown", "(", "mostly", "by", "cole", "palen", ")", "within", "the", "weekend", "airshows", "at", "old", "rhinebeck", "until", "the", "late", "1980s", "."],
                    ["lambert", "land", "is", "bounded", "in", "the", "north", "by", "the", "nioghalvfjerd", "fjord", ",", "in", "the", "east", "by", "the", "greenland", "sea", "and", "in", "the", "south", "by", "the", "zachariae", "isstrom", "."],
                    ["started", "police", "operations", "with", "4", "cessna", "cu", "206g", "officially", "on", "7", "april", "1980", "with", "operations", "focused", "in", "peninsula", "of", "malaysi", "a", "."],
                    ["mysore", "airport", "is", "away", ",", "followed", "by", "kozhikode", "international", "airport", "at", "and", "bengaluru", "international", "airport", "at", "."],
                    ["the", "egg-shaped", "qaqaarissorsuaq", "island", "is", "located", "in", "tasiusaq", "bay", ",", "in", "the", "central", "part", "of", "upernavik", "archipelago", "."],
                    ["where", "they", "inserted", "nife", "hydrogenase", "into", "polypyrrole", "films", "and", "to", "provide", "proper", "contact", "to", "the", "electrode", ",", "there", "were", "redox", "mediators", "entrapped", "into", "the", "film", "."],
                    ["the", "nt-3", "protein", "is", "found", "within", "the", "thymus", ",", "spleen", ",", "intestinal", "epithelium", "but", "its", "role", "in", "the", "function", "of", "each", "organ", "is", "still", "unknown", "."],
                    ["ted", "insists", "that", "he", "will", "have", "a", "better", "chance", "at", "winning", "since", "the", "guest", "judge", ",", "tv", "presenter", "henry", "sellers", ",", "is", "staying", "at", "the", "craggy", "island", "parochial", "house", "."],
                    ["mdm2", "binds", "and", "ubiquitinates", "p53", ",", "facilitating", "it", "for", "degradation", "."],
                    ["neuraminidase", "inhibitors", "for", "human", "neuraminidase", "(", "hneu", ")", "have", "the", "potential", "to", "be", "useful", "drugs", "as", "the", "enzyme", "plays", "a", "role", "in", "several", "signaling", "pathways", "in", "cells", "and", "is", "implicated", "in", "diseases", "such", "as", "diabetes", "and", "cancer", "."], ["at", "it", "was", "long", "enough", "to", "accommodate", "the", "belle", "steamers", "that", "carried", "trippers", "along", "the", "coast", "at", "that", "time", "."],
                    ["these", "guerrilla", "sub", "missions", "originated", "at", "brisbane", "'s", ",", "capricorn", "wharf", "or", "mios", "woendi", "."],
                    ["because", "it", "was", "originally", "an", "island", "well", "within", "lake", "texcoco", ",", "iztacalco", "was", "settled", "by", "humans", "later", "than", "the", "rest", "of", "the", "valley", "of", "mexico", "."],
                    ["the", "nordic", "countries", "had", "developed", "the", "skerry", "cruiser", "classes", "and", "the", "international", "rule", "classes", "had", "adopted", "in", "1919", "a", "new", "edition", "of", "the", "rule", "which", "was", "not", "yet", "implemented", "in", "the", "countries", "."]
                  ],

          "label": [["O", "O", "O", "O", "O", "O", "O", "O", "O", "product-software", "product-software", "O", "O", "product-software", "O", "product-software", "O", "O", "product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "product-software", "O", "O", "product-software", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "building-airport", "building-airport", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "building-airport", "building-airport", "O", "O", "O", "O", "O"], ["location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "location-island", "O", "O"], ["building-airport", "building-airport", "O", "O", "O", "O", "O", "building-airport", "building-airport", "building-airport", "O", "O", "building-airport", "building-airport", "building-airport", "O", "O"], ["O", "O", "location-island", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "other-biologything", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "location-island", "O", "O", "O"], ["other-biologything", "O", "O", "O", "other-biologything", "O", "O", "O", "O", "O", "O"], ["other-biologything", "other-biologything", "other-biologything", "other-biologything", "other-biologything", "O", "other-biologything", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "product-ship", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "product-ship", "product-ship", "O", "product-ship", "product-ship", "O", "product-ship", "product-ship", "O"], ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "location-island", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"], ["O", "O", "O", "O", "O", "O", "product-ship", "product-ship", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]]},

  "types": ["other-biologything", "building-airport", "location-island", "product-ship", "product-software"]}

About the performance

Good Work!
We test the model (5-way 5~10-shot ProtoBERT inter) using the following script:

python train_demo.py  --mode inter \
--lr 1e-3 --batch_size 5 --trainN 5 --N 5 --K 5 --Q 5 \
--train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 \
--max_length 60 --model proto

and this is our result:
p: 0.4772, r: 0.5795, f1: 0.5228
It significantly outperforms the result reported in your paper (p: 0.3609, r: 0.4996, f1: 0.4186)

Could you please expain the result?
Thanks!

Code and dataset license

Can you add a LICENSE file in the repo clarifying the licenses for the code and the datasets used in the paper? This is essential for other to properly leverage what you've done.
Thanks!

Why this dataset use IO scheme?

Hi, I am just curious why this dataset use IO scheme tagging, instead of using BIO etc...

Can you authors elaborate why ?

Thanks in advance.

How to create few shot data?

Hi, thanks for your nice work.
After downloading data, I find there just a data file supervised.
I find few shot files are needed in the script :
python3 train_demo.py --train data/mydata/train-inter.txt ...
Could you add some guide to create few shot data files like train-inter.txt in the readme?
Thanks!

Structshot

Hi.
As you writed the backbone BERT encoder we used for structshot model is not pre-trained with NER task. If I want to train structshot model in his way (e.g., directly use NER dataset to fine-tune model instead of N-way K-shot training). Is there some easy way to do it through your code?

Training Setting Issue.

Why the training settings in the README file do not match the task requirement, for example, the "--Q 5" in 5-way-510-shot and "--N 5" in 10-way-510-shot?

How to create few shot episode-data for training and test from the general custom NER data?

How can we leverage the script to create episode-data for training and test from the general custom NER data.

Though the module has the code but its a bit complex to go through it make it as utility to do this.

It would be useful to have a simple utility for 2 things:

  1. Generate episode-train/test data from custom data in required format.
  2. Inference script

My data is in this format:


[
{'text': 
'TN: ***************\nYour item was delivered at the front door or porch at 09:18\nam on Jan 15, 1991 in **********.\n
ABCD Tracking ADT® Available\nStatus\n✔ Delivered, Front Door \nJan 25, 1991 at 10:24 am\******************\nGet Updates', 

'spans': [{'start': 17, 'end': 39, 'label': 'TN', 'ngram': '*******************'}, 
        {'start': 142, 'end': 161, 'label': 'Carrier', 'ngram': 'ABCD Tracking ADT®'}, 
        {'start': -1, 'end': 5, 'label': 'seller', 'ngram': 'nannan'}, 
        {'start': -1, 'end': 5, 'label': 'cust', 'ngram': 'nannan'}, 
        {'start': 210, 'end': 233, 'label': 'DOS', 'ngram': '09:18\nam on Jan 15, 1991'}
        ]}
]

Please let me know if anyone has written Inference script and code to generate episode-train/test data?

@cyl628

episode 链接失效了

hi~,有两个问题想咨询一下
1)是处理好的 episode dataset链接失效了,
2)关于Few-Shot的问题,我看代码是每条样本都会采样N个类别,每个类别采样K条样本。迭代多次的话,train set里面的每种类别的数据都会被取到,这样还是Few-shot吗?我理解的few-shot不应该是train set里只有N个类别,K条样本,共N*K条样本,然后基于这个train set 训练。

Inference

Hello, thanks for this project. I was able to correctly train a structshot using the train script. Could you show how to correctly run the inference for an input sequence?
In my understanding, the loading would look like

import os
import torch

from fewnerd.util.word_encoder import BERTWordEncoder
from fewnerd.model.proto import Proto
from fewnerd.model.nnshot import NNShot

# cache dir
cache_dir = os.getenv("cache_dir", "../../models")
model_path = 'structshot-inter-5-5-seed0.pth.tar'
model_name = 'structshot'
pretrain_ckpt = 'bert-base-uncased'
max_length = 100

# BERT word encoder
word_encoder = BERTWordEncoder(
        pretrain_ckpt,
        max_length)

if model_name == 'proto':
    # use dot instead of L2 distance for proto
    model = Proto(word_encoder, dot=True)
elif model_name == 'nnshot':
    model = NNShot(word_encoder, dot=False)
elif model_name == 'structshot':
    model = NNShot(word_encoder, dot=False)

model.load_state_dict(torch.load(os.path.join(cache_dir, model_path)))

but I get some errors on the state dicts (RuntimeError: Error(s) in loading state_dict for NNShot:) in this way...

Thank you in advance!

-1 in labels

Dear authors,

I am not sure what -1 represents in the labels. For instance, when I first looked at the below example, I thought -1 is given to tokens with "#".

Token 1: Token1: ['continuing', 'north', ',', 'the', 'road', 'intersects', '67', '##th', 'street', 'and', 'enters', 'jackson', 'park', ‘.']
Labels1: [0, 0, 0, 0, 0, 0, 2, -1, 2, 0, 0, 0, 0, 0]

but when I look at another example

Token2: ['this', 'site', 'is', 'served', 'by', 'the', '', '', 'prem', '##et', '##ro', '', '', '(', ...]
Labels2: [0, 0, 0, 0, 0, 0, 0, -1, 2, -1, -1, 0, -1, 0, ...]

There seem to be other rules. could you please explain when the value -1 is assigned to the label? is -1 totally ignored during the process of training and inference?

Using custom dataset

Hi, I attempted to use custom datasets by replacing respective data (under data/intra/train | test | dev.txt). However, I am encountering the error where I get either:

ZeroDivisionError when using proto model, or
image

Cannot perform max on tensor with no elements
image

Is there any other areas I should be amending in the data/code in order to use a custom dataset for it?

Thanks!

High CPU No CPU usage -> Sampler while loop problem

Thanks for your realease.

While I try to run the code, it loads the model to the GPU (approx 2GB for XLMR) then it run mostly on CPU (20 processes at 100%). Is this a thing when it try to decode using Viterbi.

Thanks!

Bug in few-shot sampling process

Hi!
It seems like function

def __valid_sample__(self, sample, set_class, target_classes):

has a bug, which affects sampling process.

You can see "for loop" and 4 "if" inside it. If flag isvalid is set to True or False in some iterations, it can be overwritten in next iterations of the loop. So, for example, first class breaks the rule shots_count <= 2*K, but second class is new and flag isvalid will be returned as True from __valid_sample__, but sample breaks validation rules.

Thus, with this bug, we can sample samples, which are breaking sampling rules, e.g. with shots_count > 2*K.

What do you think? Does this bug affects results somehow? Maybe in the good way even.

Raw dataset links not working

All the raw dataset links for Supervised, Intra and Inter datasets don't work on the website as well as the link (this one) on the readme which prompts to download episodes.

I assume things moved around. Can you please update the links or provide alternate ways to download the raw data? I am mostly interested in the supervised version of the dataset where everything is available, since the inter and intra splits are available elsewhere through the website.

when execute `bash data/download.sh supervised` , the following error occurred.

--2021-10-07 11:03:02-- https://cloud.tsinghua.edu.cn/f/ae245e131e5a44609617/?dl=1
Resolving cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)... 101.6.8.7
Connecting to cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)|101.6.8.7|:443... connected.
ERROR: cannot verify cloud.tsinghua.edu.cn's certificate, issued by ‘/C=US/O=Let's Encrypt/CN=R3’:
Issued certificate has expired.
To connect to cloud.tsinghua.edu.cn insecurely, use `--no-check-certificate'.

您好,请教一个关于few-shot实验的细节问题。

在您的论文中,将整个实体集划分成了三个部分,用作训练、验证和测试,并且进行了实体的替换以防止信息泄露。论文中是这么描述的:To avoid the observation of new entity types in the training phase, we replace the labels of entities that belong to Etest with O in the training set. Similarly, in the test set, entities that belongs to Etrain and Edev are also replaced by O. 根据这一描述,训练集只将属于测试集的实体替换成了O,属于验证集的实体似乎是没有被替换的,并且也没有提到验证集的处理方法。我下载了您公开的intra数据集并写了脚本统计了一下,发现训练集中属于验证集的实体也被替换成了O,而且三个数据集的实体是互不相交的。
所以我现在有点困惑,论文的描述和实际的做法是不一致的,是我对您论文中的描述理解错误了吗?

502 Bad Gateway

Hello, thank you for creating this very interesting and important resource. I am trying to download the data, but getting a 502 Bad Gateway error from Nginx.

如何使用模型

我在服务器训练了相关模型并进行了保存,我怎么样可以使用已经训练的模型来分析我自己的数据集?

Cannot ingest data into either train_demo or pre-processing

I finally was able to use Google Colab in conjunction with:

from datasets import load_dataset dataset = load_dataset("dfki-nlp/few-nerd", 'supervised')

to get the supervised data (I am starting with supervised). It appears train_demo is expecting the two column format on the README page, but the data file train.txt contains 4 columns: id, tokens, ner_tags and fine_ner tags. So that won't work.

So I tried pre-processing.py on the train.txt file, but that tries to load a json file with a column named srcTxt so that won't work.

So in trying to reproduce the supervised results. how do I get the format I have the data in into something that the train_demo file will approve of like is shown on the README page.

And what is it that I am downloading from datasets?

UnboundLocalError: local variable 'label' referenced before assignment

If I set the code to run with cpu instead of gpu, I get the above error.

This happens because in train function of FewShotNERFramework in framework.py, label is only defined if torch.cuda.isavailable is true. So then there is an assert statement a few lines later that is outside of that if block which uses label, that is assert logits.shape[0] == label.shape[0], print(logits.shape, label.shape). But further lines also depend upon label being present.

I get a RuntimeError: CUDA error: out of memory should I try to run with the GPU.

The else case for cuda not available should be dealt with better other than this assert statement in my opinon even if running this code with cpu would not be supported.

the few-shot data and results.

Hi, thanks for your nice work.

However, I just find the few-shot results of different versions of arXiv Papers are different.
There are three groups of results: {v1,v2} , {v3,v4,v5}, {v6}.

I notice some issues mentioned that :
(1) there are some imperfect implementations (such as hyperparameter selection and tokenization bug).
(2) The sampled size was larger than 2K due to the data sampling bug.

May I ask if (1) is the cause of the result from {v1,v2} to {v3,v4,v5}?
And is (2) (different sampled few-shot datasets) the cause of the results from {v2, V4, V5} to {v6}?
Or there are some other reasons.

This is really nice work, and thanks again for your open source and hard work : ).
Look forward to your reply.

Thanks.

OOM problem encountered with FewShotNERDatasetWithRandomSampling

When I try:

python train_demo.py --mode inter --lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 --train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 --max_length 64 --model structshot --tau 0.32

the program gets stuck on line 399 of framework.py where it tries to get next self.train_data_loader. After a minute, it raises OOM problem.

I search for a whole night and find the len function of FewShotNERDatasetWithRandomSampling returns 100000000000, which causes OOM error.

When I change it to return len(self.samples), this problem disappears. Please fix it.

Tokenize the input word

Hi, thank you for sharing the data and code!

I just found that it seems that an input word is not correctly tokenized by the word tokenizer:

in the word_tokenizer.py file
Each word is directly converted to token id

for raw_tokens in raw_tokens_list:
     indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokens)

However, a word could be tokenized into word pieces by

for raw_tokens in raw_tokens_list:
     for word in raw_tokens:
            word_tokens = self.tokenizer.tokenize(word)

Directly converting word to token id will lead to lots of [UNK] and make the performance drop a lot.

How to do inference on my custom data after training on FewNERD data?

How do I make use of this FewNERD model to do inferencing on my data-set after training on FewNERD data.
The idea is to see how it performs on my custom -data?

Step1: I train the proto model using FEWNerd data as mentioned:

python3 train_demo.py --mode inter --lr 1e-4 --batch_size 8 --trainN 5 --N 5 --K 1 --Q 1 --train_iter 10000 --val_iter 500 --test_iter 5000 --val_step 1000 --max_length 64 --model proto --tau 0.32
Once training is complete:

I have data-set which is in this format and has entities as ['dispute_amount', 'dispute_date', 'competitors]

My test-data is in this format?


{"
word": 
          [ 
              ["Dispute", "Case", "ID", "MM-Z-*********", "the", "amount", "of", "$99.99", "should", "be", "AU", "dollars", "total", "is", "$86.85", "US", "dollars."], 
               ["8:27am,", "I", "started", "a", "claim", "for", "which", "I", "was", "refunded", "for", "one", "item,", "but", "not", "for", "the", "other,", "from", "the", "same", "seller."] 
          ], 

"label": [ 

                 ["O", "O", "O", "O", "O", "O", "O", "dispute_amount", "O", "O", "O", "O", "O", "O", "dispute_amount", "dispute_amount", "dispute_amount"], 
                 ["dispute_date", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"] 
              ],  

"types": [["dispute_amount"], ["dispute_date"]]

}

How can I test and print outputs on my test data?

@cyl628 @ningding97

About the download bash script

Hi,I want to know when I run the download.sh script,the terminal puts this error for me.
root@lwc:~/few# ./a.sh supervised supervised ./a.sh: 4: [: supervised: unexpected operator ./a.sh: 10: [: supervised: unexpected operator ./a.sh: 16: [: supervised: unexpected operator ./a.sh: 22: [: supervised: unexpected operator

I want to know what's wrong with the script.

Thank you so much for your reply.

torch.stack(batch_query[k], 0) raises an error

Each sentence has different length and different number of sections, so batch_query['word'] have different dimensions. In data_loader.py, torch.stack(batch_query['word'], 0) will raise an error.

What is the difference between episode-data model training and non-episode-data.

Inside data/episode-data/inter/train_5_5.jsonl I see lot of training, test and dev data. They are like support and query .

In the data/inter/train.txt I see a different form of training data without support and query .

What is the purpose to keep these 2 formats of data?

Does both are used to train few-shot learning?

@cyl628

some questions about Few-NERD(intra) setting

Hi, I use the script in inter setting to run the Few-NERD(intra) task, but the best result of 5 way 1~2 shot is just 18.47 F1(35.92 reported in the paper). I think it might have something to do with the choice of the hyperparameters. Can you release the script about the intra task?

download link seems to have expired

when I try to download raw data (supervised) using either link provided in this page Few-NERD or the one in the data/download.sh, I get a 404 response. Would you kindly provide an updated link on the datasets?

Not using the Span F1?

In the code (and also the paper)

def get_class_name(rawtag):

it seems like that if a tag sequence is O, B-Person, I-Person, B-Person, I-Person, O" (which has two entities) will be collapsed into sequence only contains one entity like O, Person, Person, Person, Person, O". Is my understanding correct?

Difference of the structshot model and the original.

I have read the original structshot paper:https://aclanthology.org/2020.emnlp-main.516/.
In this paper,structshot train one NER model that consists of a token embedder (BERT) followed by a linear classifier by supervised learning in train phase, and evaluate the structshot model that uses the finetuned token embedder.
however, in your paper and code, you instead train the structshot model directly by meta learning.
Do you tell me why? I think this don't reflect the performance of original structshot.

bug with tokenize

output. i an not sure what is worng

Traceback (most recent call last):
  File "/home/william/Few-NERD/train_demo.py", line 183, in <module>
    main()
  File "/home/william/Few-NERD/train_demo.py", line 149, in main
    framework = FewShotNERFramework(train_data_loader, val_data_loader, test_data_loader, N=opt.N, tau=opt.tau, train_fname=opt.train, viterbi=True, use_sampled_data=opt.use_sampled_data)
  File "/home/william/Few-NERD/util/framework.py", line 299, in __init__
    abstract_transitions = get_abstract_transitions(train_fname, use_sampled_data=use_sampled_data)
  File "/home/william/Few-NERD/util/framework.py", line 30, in get_abstract_transitions
    tag_lists = [sample.tags for sample in samples]
  File "/home/william/Few-NERD/util/framework.py", line 30, in <listcomp>
    tag_lists = [sample.tags for sample in samples]
  File "/home/william/Few-NERD/util/data_loader.py", line 203, in __getitem__
    support_set = self.__populate__(support_idx)
  File "/home/william/Few-NERD/util/data_loader.py", line 186, in __populate__
    tokens, labels = self.__get_token_label_list__(self.samples[idx])
  File "/home/william/Few-NERD/util/data_loader.py", line 114, in __get_token_label_list__
    word_tokens = self.tokenizer.tokenize(word)
AttributeError: 'NoneType' object has no attribute 'tokenize'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.