Giter Club home page Giter Club logo

abtextsum's Introduction

abtextsum

This source code has been used in the experimental procedure of the following paper:

Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis. 2019. Abstractive text summarization based on deep learning and semantic content generalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5082-5092.

This paper is accessible in the Proceedings of the 57th ACL Annual Meeting (2019) or directly from here.
For citing, the BibTex follows:

@inproceedings{kouris2019abstractive,
  title={Abstractive text summarization based on deep learning and semantic content generalization},
  author={Kouris, Panagiotis and Alexandridis, Georgios and Stafylopatis, Andreas},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  month = jul,
  year={2019}
  address = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  url = {https://www.aclweb.org/anthology/P19-1501},
  pages={5082--5092},
}



Code Description

The code described below follows the methodology and the assumptions which are described in detail in the aforementioned paper. The experimental procedure, as it is described in the paper, requires as initial dataset for training, validation and testing the Gigaword dataset as it is described by Rush et. al. 2015 (see references in the paper). Also for testing, the DUC 2004 dataset is used as this is also described in the paper.
According to the paper, the initial dataset is preprocessed furthermore and generalized to one of the proposed text generalization strategies (e.g. NEG100 or LG200d5). Then the generalized dataset is used for training where the deep learning model learns to predict a generalized summary.
In the phase of testing, a generalized article (e.g. an article of the test set) is given as input to the deep learning model which predicts the respective generalized summary. Then, in the phase of post-processing, the generalized concepts of the generalized summary are replaced by the specific concepts of the original (preprocessed) article producing the final summary.

The workflow of this framework follows:

  1. Preprocessing of the dataset
    The task of preprocessing of the dataset is performed by DataPreprocessing class (preprocessing.py file). The method clean_dataset() is used for preprocessing the Gigaword dataset while the method clean_duc_dataset_from_original_to_cleaned() is used for DUC dataset.

  2. Text generalization
    Both text generalization tasks, NEG and LG, are performed by DataPreprocessing class (preprocessing.py file).
    Firstly, part-of-speach tagging is required which is performed by pos_tagging_of_dataset_and_vocabulary_of_words_pos_frequent() method for Gigaword dataset and pos_tagging_of_duc_dataset_and_vocab_pos_frequent() method for DUC dataset. Then the NEG and LG strategy can be applied as follows:

    1. NEG Strategy
      The annotation of named entities is performed by ner_of_dataset_and_vocabulary_of_ner_words() method for Gigaword dataset and ner_of_duc_dataset_and_vocab_of_ne() method for DUC dataset. Then the methods conver_dataset_with_ner_from_stanford_and_wordnet() for Gigaword dataset and conver_duc_dataset_with_ner_from_stanford_and_wordnet() for DUC dataset generalize these datasets according to NEG strategy having set the parameters accordingly.

    2. LG Strategy
      The word_freq_hypernym_paths() method produces a file that contains a vocabulary with the frequency and the hypernym path of each word. Then this file is used by vocab_based_on_hypernyms() method in order to produce a file that contains a vocabulary with those words that are candidates for generalization. Finally, for the Gigaword dataset, the convert_dataset_to_general() method produces the files with summary-article pairs which constitute the generalized dataset, while for DUC dataset the convert_duc_dataset_based_on_level_of_generalizetion() method is used. The hyperparameters of these methods should be set accordingly.

  3. Building dataset for training, validation and testing
    The BuildDataset class (build_dataset.py file) creates the files which are given as input to the deep learning model for training, validation or testing.
    To build the dataset, the appropriate file paths should be set in the __inint__() of BuildDataset class executing the following commands, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).

    1. Building the training dataset: python build_dataset.py -mode train -model lg100d5g
    2. Building the validation dataset: python build_dataset.py -mode validation -model lg100d5g
    3. Building the testing dataset: python build_dataset.py -mode test -model lg100d5g
  4. Training
    The process of training is performed by Train Class (file train_v2.py) having set the hyperparameters accordingly. The files which are produced from the previous step of Building dataset are used as input in this phase of training. The process of training is performed by the command: python train.py -model neg100, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).

  5. Post-processing of generalized summaries
    In the phase of testing, the task of post-processing of the generalized summaries, which are produced by the deep learning model, is required to replace the generalized concepts of the generalized summary with the specific ones from the corresponding original articles. This task is performed by PostProcessing class by setting the parameters in __init__() method accordingly. More specifically, the mode should be set to "lg" or "neg" according to the employed text generalization strategy. Also, the hyperparameters of neg_postprocessing() and lg_postprocessing() methods for file paths, text similarity function and the context window should be set accordingly.

  6. Testing
    The Testing class (file testing.py) performs the process of testing of this framework. For the Gigaword dataset, a subset of its test set (e.g. 4000 instances) should be used in order to evaluate the framework while for the DUC dataset, the whole set of instances is used. The Testing class requires the official ROUGE package for measuring the performance of the proposed framework.
    In order to perform the task of testing, the appropriate file paths should be set in the __init__() of Testing class running one of the following modes:

    1. Testing for gigaword: python testing.py -mode gigaword
    2. Testing for duc: python testing.py -mode duc
    3. Testing for duc capped to 75 bytes: python testing.py -mode duc75b

Setting parameters and paths
The values of hyperparameters should be specified in the file parameters.py, while the paths of the corresponding files should be set in the file paths.py.
Additionally, a file with word embeddings (e.g. word2vec) is required where its file path and the dimension of the vectors (e.g. 300) should be specified in the files paths.py and parameters.py, respectively.

The project was developed in python 3.5 and the required python packages are included in the file requirements.txt.

The above described code includes the functionality that was used in the experimental procedure of the corresponding paper. However, the proposed framework is not limited by the current implementation as it is based on a well defined theoretical model that may provide the possibility of enhancing its performance by extending or improving this implementation (e.g. using a better taxonomy of concepts, a different machine learning model or an alternative similarity method for the post-processing task).

abtextsum's People

Contributors

pkouris avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.