Giter Club home page Giter Club logo

amicorpusxml's Introduction

DOI

About

  • Extracts meetings transcript and summary from AMI Meeting Corpus: extractive and abstractive
  • Transforms into CNN-DailyMail News dataset (.story files with article and highlight in it)

Contents

RequirementsAbout AMI Meeting CorpusAMI DialSum CorpusHow to UseHow to Cite

Requirements

Tested on Python 3.6+, Ubuntu 16.04, Mac OS

pip install nltk

AMI Meeting Corpus

  • More info
  • Number of meetings (including scenario and non-scenario): 171
    • Number of speakers per meeting: 4-5
    • Total number of transcripts: 687
  • Number of summaries: 142 (abstractive) and 137 (extractive)
    • Only available for meetings with names starting with ES, IS and TS

How to Use

Download AMI Corpus and extract .story files

python main_obtain_meeting2summary_data.py --summary_type abstractive

Already made .story dataset has been provided under data/ami-transcripts-stories/

Configuration options

Argument Type Default
summary_type string "abstractive"
ami_xml_dir string "data/"
results_transcripts_speaker_dir string "data/ami-transcripts-speaker/"
results_transcripts_dir string "data/ami-transcript/"
results_summary_dir string "data/ami-summary/"
  • summary_type is the type of summary to be extracted. Options=["abstractive", "extractive"].
  • ami_xml_dir is the directory where the AMI Corpus will be downloaded
  • results_transcripts_speaker_dir is the directory where each speaker's transcript will be saved
  • results_transcripts_dir is the directory where each meeting's transcript will be saved
  • results_summary_dir is the directory where each meeting's summary will be saved

Code explanation

  1. Obtain summaries

    • Summaries are originally saved in data/ami_public_manual_1.6.2/words/*.xml
    • Example: EN2001a.A.words.xml
      • Meeting name: EN2001
      • Meetings are divided into 1 hour parts: a (each hour is a consecutive lowercase letter)
      • Speaker: A (usually there are four speakers named A, B, C and D, but E is sometimes also present)
    • How:
      • Each .xml file has a number of tags with the words and their respective times in the audio/video file.
      • In order for us to extract the summaries, we have to put those words back together in sentences and paragraphs.
      • Thus xml parsing is required.
    • Output: 2 folders with corresponding .txt files
      • data/ami-transcripts-speaker/: meeting transcripts for each speaker
      • data/ami-transcripts/: complete meeting transcripts (all speakers together)
  2. Obtain abstractive summaries

    • Located in data/ami_public_manual_1.6.2/abstractive/*.xml
    • How:
      • Extract text between abstract tag
      • Text between abstract tag is composed of text in sentence tags
      • Return all these tags as a paragraph
    • Output: data/ami-summary/abstractive/
  3. Obtain extractive summaries

    • Located in data/ami_public_manual_1.6.2/extractive/*.xml
    • How:
      • Extract text between extsumm tag
      • Text between extsumm tag is composed of children nodes such as the below examples:
        • Example 1: <nite:child href="ES2002a.B.dialog-act.xml#id(ES2002a.B.dialog-act.dharshi.3)"/>
          • This node refers to a node of ID ES2002a.B.dialog-act.dharshi.3 in a file named ES2002a.B.dialog-act.xml in data/dialogueActs/
          • This ID is:
            <dact nite:id="ES2002a.B.dialog-act.dharshi.3" reflexivity="true">
                <nite:pointer role="da-aspect"  href="da-types.xml#id(ami_da_4)"/>
                <nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words4)..id(ES2002a.B.words16)"/>
            </dact>
            
          • This indicates words 4 to 16 in ES2002a.B.words.xml in data/words/
        • Example 2: <nite:child href="ES2002a.D.dialog-act.xml#id(ES2002a.D.dialog-act.dharshi.16)..id(ES2002a.D.dialog-act.dharshi.20)"/>
          • This node refers to nodes of ID ES2002a.D.dialog-act.dharshi.16 to 20 in a file named ES2002a.D.dialog-act.xml in data/dialogueActs/
      • Return all the collected words as a paragraph
    • Output: data/ami-summary/extractive/

Extra: AMI DialSum Meeting Corpus

  • DialSum: modified version of the AMI Meeting Dataset
  • Use script ami_dialsum_meeting_story.py:
    • This script takes 2 text files (in and sum) and formats it into a series of .story files compatible with the CNN/DM format
    • Each line in file in corresponds to a meeting transcript with summary present in the same line in file sum/.

Notes

  • XML reader in Python:

  • TODO

    • Overlapping meeting transcript
    • Decision abstract

Acknowledgement

Please star or fork if this code was useful for you. If you use it in a paper, please cite as:

@software{cunha_sergio2019ami_xml2story,
    author       = {Gwenaelle Cunha Sergio},
    title        = {{gcunhase/AMICorpusXML: Obtaining Transcript and Abstractive and Extractive Summaries from the AMI Meeting Corpus and formatting the AMI DialSum Meeting Corpus}},
    month        = dec,
    year         = 2019,
    doi          = {10.5281/zenodo.3561298},
    version      = {v2.1},
    publisher    = {Zenodo},
    url          = {https://github.com/gcunhase/AMICorpusXML}
    }

If you use the AMI Meeting Corpus, please also add the following citation:

@INPROCEEDINGS{Mccowan05theami,
    author = {I. Mccowan and G. Lathoud and M. Lincoln and A. Lisowska and W. Post and D. Reidsma and P. Wellner},
    title = {The AMI Meeting Corpus},
    booktitle = {In: Proceedings Measuring Behavior 2005, 5th International Conference on Methods and Techniques in Behavioral Research. L.P.J.J. Noldus, F. Grieco, L.W.S. Loijens and P.H. Zimmerman (Eds.), Wageningen: Noldus Information Technology},
    year = {2005}
}

amicorpusxml's People

Contributors

gcunhase avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

amicorpusxml's Issues

Custom transcript summarization

How can I use this code and create summary for a new raw transcript containing multiple user dialogues?Is there any major changes needed.

A question about data format

Hello. First of all thank you for this wonderful library.

I have a question about the transcription format.

For each .txt file in data/ami-transcripts, does each line denote a single speaker?

For example, would each line of EN2001a belong to a separate speaker, of which there are 5?

Thank you!

Train/Val/Test split

Is there any standard for the train/val/test split for abstract summarization?

Various sentences by different speakers interaligned

Hi,

From what I understand, the final transcripts that you create for every individual meeting contains all the sentences spoken by every speaker individually. But there are no alignments present between them which means it cannot be turned into an actual dialogue/conversation.

For example I want the following conversation,

Speaker A : Hi
Speaker B : Hello. What are you doing here?
Speaker A : The supervisor called me in. He wanted to have a word.
Speaker B : Well good luck. He is in a mood today.

However, the final transcripts provided in your repo is of the following format.
Speaker A : Hi. The supervisor called me in. He wanted to have a word.
Speaker B : Hello. What are you doing here? Well good luck. He is in a mood today.

Can you help me creating the dialogue format that I require? Or guide me what files in AMI corpus can I use to obtain the same?

Thank You

thesis

Hi,

I am very interested in your repo. Do you have a thesis for this?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.