The abtextsum from kkwellbbetter

abtextsum

This source code has been used in the experimental procedure of the following paper:

Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis. 2019. Abstractive text summarization based on deep learning and semantic content generalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5082-5092.

This paper is accessible in the Proceedings of the 57th ACL Annual Meeting (2019) or directly from here.
For citing, the BibTex follows:

@inproceedings{kouris2019abstractive,
  title={Abstractive text summarization based on deep learning and semantic content generalization},
  author={Kouris, Panagiotis and Alexandridis, Georgios and Stafylopatis, Andreas},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  month = jul,
  year={2019}
  address = {Florence, Italy},
  publisher = {Association for Computational Linguistics},
  url = {https://www.aclweb.org/anthology/P19-1501},
  pages={5082--5092},
}

Code Description

The code described below follows the methodology and the assumptions which are described in detail in the aforementioned paper. The experimental procedure, as it is described in the paper, requires as initial dataset for training, validation and testing the Gigaword dataset as it is described by Rush et. al. 2015 (see references in the paper). Also for testing, the DUC 2004 dataset is used as this is also described in the paper.
According to the paper, the initial dataset is preprocessed furthermore and generalized to one of the proposed text generalization strategies (e.g. NEG100 or LG200d5). Then the generalized dataset is used for training where the deep learning model learns to predict a generalized summary.
In the phase of testing, a generalized article (e.g. an article of the test set) is given as input to the deep learning model which predicts the respective generalized summary. Then, in the phase of post-processing, the generalized concepts of the generalized summary are replaced by the specific concepts of the original (preprocessed) article producing the final summary.

The workflow of this framework follows:

Preprocessing of the dataset
The task of preprocessing of the dataset is performed by DataPreprocessing class (preprocessing.py file). The method clean_dataset() is used for preprocessing the Gigaword dataset while the method clean_duc_dataset_from_original_to_cleaned() is used for DUC dataset.
Text generalization
Both text generalization tasks, NEG and LG, are performed by DataPreprocessing class (preprocessing.py file).
Firstly, part-of-speach tagging is required which is performed by pos_tagging_of_dataset_and_vocabulary_of_words_pos_frequent() method for Gigaword dataset and pos_tagging_of_duc_dataset_and_vocab_pos_frequent() method for DUC dataset. Then the NEG and LG strategy can be applied as follows:
1. NEG Strategy
  The annotation of named entities is performed by ner_of_dataset_and_vocabulary_of_ner_words() method for Gigaword dataset and ner_of_duc_dataset_and_vocab_of_ne() method for DUC dataset. Then the methods conver_dataset_with_ner_from_stanford_and_wordnet() for Gigaword dataset and conver_duc_dataset_with_ner_from_stanford_and_wordnet() for DUC dataset generalize these datasets according to NEG strategy having set the parameters accordingly.
2. LG Strategy
  The word_freq_hypernym_paths() method produces a file that contains a vocabulary with the frequency and the hypernym path of each word. Then this file is used by vocab_based_on_hypernyms() method in order to produce a file that contains a vocabulary with those words that are candidates for generalization. Finally, for the Gigaword dataset, the convert_dataset_to_general() method produces the files with summary-article pairs which constitute the generalized dataset, while for DUC dataset the convert_duc_dataset_based_on_level_of_generalizetion() method is used. The hyperparameters of these methods should be set accordingly.
Building dataset for training, validation and testing
The BuildDataset class (build_dataset.py file) creates the files which are given as input to the deep learning model for training, validation or testing.
To build the dataset, the appropriate file paths should be set in the __inint__() of BuildDataset class executing the following commands, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).
1. Building the training dataset: python build_dataset.py -mode train -model lg100d5g
2. Building the validation dataset: python build_dataset.py -mode validation -model lg100d5g
3. Building the testing dataset: python build_dataset.py -mode test -model lg100d5g
Training
The process of training is performed by Train Class (file train_v2.py) having set the hyperparameters accordingly. The files which are produced from the previous step of Building dataset are used as input in this phase of training. The process of training is performed by the command: python train.py -model neg100, where the argument -model specifies the employed generalization strategy (e.g. lg100d5, neg100 etc.).
Post-processing of generalized summaries
In the phase of testing, the task of post-processing of the generalized summaries, which are produced by the deep learning model, is required to replace the generalized concepts of the generalized summary with the specific ones from the corresponding original articles. This task is performed by PostProcessing class by setting the parameters in __init__() method accordingly. More specifically, the mode should be set to "lg" or "neg" according to the employed text generalization strategy. Also, the hyperparameters of neg_postprocessing() and lg_postprocessing() methods for file paths, text similarity function and the context window should be set accordingly.
Testing
The Testing class (file testing.py) performs the process of testing of this framework. For the Gigaword dataset, a subset of its test set (e.g. 4000 instances) should be used in order to evaluate the framework while for the DUC dataset, the whole set of instances is used. The Testing class requires the official ROUGE package for measuring the performance of the proposed framework.
In order to perform the task of testing, the appropriate file paths should be set in the __init__() of Testing class running one of the following modes:
1. Testing for gigaword: python testing.py -mode gigaword
2. Testing for duc: python testing.py -mode duc
3. Testing for duc capped to 75 bytes: python testing.py -mode duc75b

Setting parameters and paths
The values of hyperparameters should be specified in the file parameters.py, while the paths of the corresponding files should be set in the file paths.py.
Additionally, a file with word embeddings (e.g. word2vec) is required where its file path and the dimension of the vectors (e.g. 300) should be specified in the files paths.py and parameters.py, respectively.

The project was developed in python 3.5 and the required python packages are included in the file requirements.txt.

The above described code includes the functionality that was used in the experimental procedure of the corresponding paper. However, the proposed framework is not limited by the current implementation as it is based on a well defined theoretical model that may provide the possibility of enhancing its performance by extending or improving this implementation (e.g. using a better taxonomy of concepts, a different machine learning model or an alternative similarity method for the post-processing task).

kkwellbbetter / abtextsum Goto Github PK

abtextsum's Introduction

abtextsum

Code Description

abtextsum's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent