The amicorpusxml from gcunhase

About

Extracts meetings transcript and summary from AMI Meeting Corpus: extractive and abstractive
Transforms into CNN-DailyMail News dataset (.story files with article and highlight in it)

Requirements • About AMI Meeting Corpus • AMI DialSum Corpus • How to Use • How to Cite

Requirements

Tested on Python 3.6+, Ubuntu 16.04, Mac OS

pip install nltk

AMI Meeting Corpus

More info
Number of meetings (including scenario and non-scenario): 171
- Number of speakers per meeting: 4-5
- Total number of transcripts: 687
Number of summaries: 142 (abstractive) and 137 (extractive)
- Only available for meetings with names starting with ES, IS and TS

How to Use

Download AMI Corpus and extract .story files

python main_obtain_meeting2summary_data.py --summary_type abstractive

Already made .story dataset has been provided under data/ami-transcripts-stories/

Configuration options

Argument	Type	Default
`summary_type`	string	`"abstractive"`
`ami_xml_dir`	string	`"data/"`
`results_transcripts_speaker_dir`	string	`"data/ami-transcripts-speaker/"`
`results_transcripts_dir`	string	`"data/ami-transcript/"`
`results_summary_dir`	string	`"data/ami-summary/"`

summary_type is the type of summary to be extracted. Options=["abstractive", "extractive"].
ami_xml_dir is the directory where the AMI Corpus will be downloaded
results_transcripts_speaker_dir is the directory where each speaker's transcript will be saved
results_transcripts_dir is the directory where each meeting's transcript will be saved
results_summary_dir is the directory where each meeting's summary will be saved

Code explanation

Obtain summaries
- Summaries are originally saved in data/ami_public_manual_1.6.2/words/*.xml
- Example: EN2001a.A.words.xml
  - Meeting name: EN2001
  - Meetings are divided into 1 hour parts: a (each hour is a consecutive lowercase letter)
  - Speaker: A (usually there are four speakers named A, B, C and D, but E is sometimes also present)
- How:
  - Each .xml file has a number of tags with the words and their respective times in the audio/video file.
  - In order for us to extract the summaries, we have to put those words back together in sentences and paragraphs.
  - Thus xml parsing is required.
- Output: 2 folders with corresponding .txt files
  - data/ami-transcripts-speaker/: meeting transcripts for each speaker
  - data/ami-transcripts/: complete meeting transcripts (all speakers together)
Obtain abstractive summaries
- Located in data/ami_public_manual_1.6.2/abstractive/*.xml
- How:
  - Extract text between abstract tag
  - Text between abstract tag is composed of text in sentence tags
  - Return all these tags as a paragraph
- Output: data/ami-summary/abstractive/
Obtain extractive summaries
- Located in data/ami_public_manual_1.6.2/extractive/*.xml
- How:
  - Extract text between extsumm tag
  - Text between extsumm tag is composed of children nodes such as the below examples:
    - Example 1: <nite:child href="ES2002a.B.dialog-act.xml#id(ES2002a.B.dialog-act.dharshi.3)"/>
      - This node refers to a node of ID ES2002a.B.dialog-act.dharshi.3 in a file named ES2002a.B.dialog-act.xml in data/dialogueActs/
      - This ID is:
        <dact nite:id="ES2002a.B.dialog-act.dharshi.3" reflexivity="true"> <nite:pointer role="da-aspect" href="da-types.xml#id(ami_da_4)"/> <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words4)..id(ES2002a.B.words16)"/> </dact>
      - This indicates words 4 to 16 in ES2002a.B.words.xml in data/words/
    - Example 2: <nite:child href="ES2002a.D.dialog-act.xml#id(ES2002a.D.dialog-act.dharshi.16)..id(ES2002a.D.dialog-act.dharshi.20)"/>
      - This node refers to nodes of ID ES2002a.D.dialog-act.dharshi.16 to 20 in a file named ES2002a.D.dialog-act.xml in data/dialogueActs/
  - Return all the collected words as a paragraph
- Output: data/ami-summary/extractive/

Extra: AMI DialSum Meeting Corpus

DialSum: modified version of the AMI Meeting Dataset
Use script ami_dialsum_meeting_story.py:
- This script takes 2 text files (in and sum) and formats it into a series of .story files compatible with the CNN/DM format
- Each line in file in corresponds to a meeting transcript with summary present in the same line in file sum/.

Notes

XML reader in Python:
- Minidom vs Element Tree: Reading XML files in Python
- Minidom: XML parser for Python
TODO
- Overlapping meeting transcript
- Decision abstract

Acknowledgement

Please star or fork if this code was useful for you. If you use it in a paper, please cite as:

@software{cunha_sergio2019ami_xml2story,
    author       = {Gwenaelle Cunha Sergio},
    title        = {{gcunhase/AMICorpusXML: Obtaining Transcript and Abstractive and Extractive Summaries from the AMI Meeting Corpus and formatting the AMI DialSum Meeting Corpus}},
    month        = dec,
    year         = 2019,
    doi          = {10.5281/zenodo.3561298},
    version      = {v2.1},
    publisher    = {Zenodo},
    url          = {https://github.com/gcunhase/AMICorpusXML}
    }

If you use the AMI Meeting Corpus, please also add the following citation:

@INPROCEEDINGS{Mccowan05theami,
    author = {I. Mccowan and G. Lathoud and M. Lincoln and A. Lisowska and W. Post and D. Reidsma and P. Wellner},
    title = {The AMI Meeting Corpus},
    booktitle = {In: Proceedings Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research. L.P.J.J. Noldus, F. Grieco, L.W.S. Loijens and P.H. Zimmerman (Eds.), Wageningen: Noldus Information Technology},
    year = {2005}
}

gcunhase / amicorpusxml Goto Github PK

amicorpusxml's Introduction

About

Contents

Requirements

AMI Meeting Corpus

How to Use

Configuration options

Code explanation

Extra: AMI DialSum Meeting Corpus

Notes

Acknowledgement

amicorpusxml's People

Contributors

Stargazers

Watchers

Forkers

amicorpusxml's Issues

Recommend Projects

Recommend Topics

Recommend Org