Giter Club home page Giter Club logo

kpdrop's Introduction

Keyphrase Dropout

Official code for the paper: KPDROP: Improving Absent Keyphrase Generation (EMNLP Findings 2022) https://arxiv.org/abs/2112.01476

Generally, we find that we can improve the absent performance of any baseline model that we have tried so far (T5, One2Set, CatSeq, One2One etc.) with KPDROP-A, or KPDROP-R+beam search without harming present performance.

Additionally, we were able to contribute to the semi-supervised settings for keyphrase generation. KPDRop allows better exploitation of synthetic data for enhanced absent performance.

Credits

Requirements

Check environment.yml (All of them may not be required, but that's the environment I worked on.)

Relevant Keyphrase Dropout Code:

If you just want to use Keyphrase Dropout in a different codebase you can refer to: https://github.com/JRC1995/KPDrop/blob/main/collaters/seq2seq_collater.py#L39

Its expected inputs, src and trg, are list of tokens. trg should be keyphrases deliminited by ";" and end with "<eos>". (["keyphhrase1-first-word", "keyphrase1-second word", ";" "keyphrase-2", "<eos>"]) (but you can make necessary minor modifications within the code to change these requirements). The main KPDropped outputs are new_src and new_trg which are in the same format as src and trg. You can remove the construction of other return variables and create whatever format of input and output you need from new_src and new_trg as needed for your task.

Datasets

Preprocess

  • In the root project direction in cmd: cd preprocess
  • Then run the following commands sequentially:
  • python process_kp20k.py
  • python process_kp20k2.py
  • python process_kp20k_split.py
  • python process_kp20k_split2.py
  • cd ..
  • bash get_embedding.sh (use Liang et al. 2021 extractor to generate synthetic labels)
  • bash ranker.sh (use Liang et al. 2021 extractor to generate synthetic labels)
  • cd preprocess
  • python process_kp20k_big_unsup.py

Alternatively see if you can download the processed data from here (Keep the processed_data folder in the root directory)

Supervised Experiments

  • (set device=[whatever gpu/cpu device you want] instead of cuda:0, if something else is needed)
  • GRU One2Many (baseline): python train.py --model=GRUSeq2Seq --model_type=seq2seq --times=3 --dataset=kp20k2 --device=cuda:0
  • GRU One2Many (KPD-R): python train.py --model=GRUSeq2SeqKPD0_7 --model_type=seq2seq --times=3 --dataset=kp20k2 --device=cuda:0
  • GRU One2Many (KPD-A): python train.py --model=GRUSeq2SeqKPD0_7A --model_type=seq2seq --times=3 --dataset=kp20k2 --device=cuda:0
  • GRU One2One (baseline): python train.py --model=GRUSeq2One --model_type=seq2seq --times=3 --dataset=kp20k2 --device=cuda:0
  • GRU One2One (KPD-R): python train.py --model=GRUSeq2OneKPD0_7 --model_type=seq2seq --times=3 --dataset=kp20k2 --device=cuda:0
  • GRU One2One (KPD-A): python train.py --model=GRUSeq2OneKPD0_7A --model_type=seq2seq --times=3 --dataset=kp20k2 --device=cuda:0
  • Transformer One2Set (baseline): python train.py --model=TransformerSeq2Set --model_type=seq2set --times=3 --dataset=kp20k --device=cuda:0
  • Transformer One2Set (KPD-R): python train.py --model=TransformerSeq2SetKPD0_7 --model_type=seq2seq --times=3 --dataset=kp20k --device=cuda:0
  • Transformer One2Set (KPD-A): python train.py --model=TransformerSeq2SetKPD0_7A --model_type=seq2seq --times=3 --dataset=kp20k --device=cuda:0

Semi-Supervised Experiments:

  • (set device=[whatever gpu/cpu device you want] instead of cuda:0, if something else is needed)
  • GRU One2Many (PT): python train.py --model=GRUSeq2Seq --model_type=seq2seq --times=3 --dataset=kp20k_big_unsup --device=cuda:0
  • GRU One2Many (PT+KPD-R): python train.py --model=GRUSeq2SeqKPD0_7 --model_type=seq2seq --times=3 --dataset=kp20k_big_unsup --device=cuda:0
  • GRU One2Many (PT+KPD-A): python train.py --model=GRUSeq2SeqKPD0_7A --model_type=seq2seq --times=3 --dataset=kp20k_big_unsup --device=cuda:0
  • GRU One2Many (FT): python train.py --model=GRUSeq2SeqKPD0_7A --model_type=seq2seq --times=3 --dataset=kp20k_low_res --device=cuda:0
  • GRU One2Many (PT; FT): python train.py --model=GRUSeq2SeqKPD0_7Afrom_big --model_type=seq2seq --times=3 --dataset=kp20k_low_res --device=cuda:0
  • GRU One2Many (PT+KPD-R; FT) (only possible after PT+KPD-R): python train.py --model=GRUSeq2SeqKPD0_7Afrom_bigKPD0_7 --model_type=seq2seq --times=3 --dataset=kp20k_low_res --device=cuda:0
  • GRU One2Many (PT+KPD-A; FT) (only possible after PT+KPD-A): python train.py --model=GRUSeq2SeqKPD0_7Afrom_big0_7A --model_type=seq2seq --times=3 --dataset=kp20k_low_res --device=cuda:0

(Use the same chain of commands for GRUSeq2One (GRU One2One) and TransformerSeq2Set (TransformerOne2Set) experiments just replace substring GRUSeq2Seq with GRUSeq2One or TransformerSeq2Set whenever applicable, and replace model_type=seq2seq to model_type=seq2set when using TransformerSeq2Set)

Decoding

  • Add the following arguments when running train.py for Greedy decoding based test: --decode_mode=Greedy --test
  • Add the following arguments when running train.py for Beam decoding based test: --decode_mode=BeamLN --test

(every other arguments should be same as that for training for the corresponding model)

Evaluation

In the evaluation, @5R represents the evaluation in the style of Chan et al. and others, where they do the equivalent of adding "dummy keyphrases" if the top (here top 5) selected keyphrases are less than k (i.e., 5). Our @5R is equivalent to theirs @5 and @5C in our paper (Appendix).

Citation

@inproceedings{ray-chowdhury-etal-2022-kpdrop,
    title = "{KPDROP}: Improving Absent Keyphrase Generation",
    author = "Ray Chowdhury, Jishnu  and
      Park, Seo Yeon  and
      Kundu, Tuhin  and
      Caragea, Cornelia",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.357",
    pages = "4853--4870",
    abstract = "Keyphrase generation is the task of generating phrases (keyphrases) that summarize the main topics of a given document. Keyphrases can be either present or absent from the given document. While the extraction of present keyphrases has received much attention in the past, only recently a stronger focus has been placed on the generation of absent keyphrases. However, generating absent keyphrases is challenging; even the best methods show only a modest degree of success. In this paper, we propose a model-agnostic approach called keyphrase dropout (or KPDrop) to improve absent keyphrase generation. In this approach, we randomly drop present keyphrases from the document and turn them into artificial absent keyphrases during training. We test our approach extensively and show that it consistently improves the absent performance of strong baselines in both supervised and resource-constrained semi-supervised settings.",
}

Contact: [email protected] for any issues or questions.

kpdrop's People

Contributors

jrc1995 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.