Giter Club home page Giter Club logo

conversation_mining's Introduction

Conversation mining

Conversational data

Publicly available datasets with conversation transcripts annotated with dialog (speech) acts:

Dialog State Tracking Challenge Series provided several datasets with annotated information-seeking dialog transcripts for traveling and restaurant domains. Some of them are freely available. These datasets were created to evaluate and compare performance of dialog state trackers, systems able to interpret the user's action. They also include ontologies describing the domain, which consists of attributes (slots) with a set of possible values for each of the attributes. The transcripts are annotated with the dialog acts, user goals, methods, attributes, time-stamps as well as the user feedback.

  • DSTC1 The domain is route information for buses in Pittsburgh. Codebook License: MSR-LA

  • DSTC2 labeled human-computer dialogs in restaurant information domain. JSON format. The domain of a dataset is described by an ontology object, also distributed in JSON. Phoenix grammar. The dialog-act notation closely matches that used in DSTC1.

  • The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2 with turn/utterance-level dialog-act tags. The dataset contains conversation transcripts of telephone conversations annotated with 43 dialog-act tags, part-of-speech tags, lemmas and parse trees. Description Codebook License: GNU GPL v2.0.
  • Spoken Conversational Search (SCS) Data Set provides conversational transcripts collected for the pre-defined search tasks performed in a conversational speech-only setting. The transcripts are annotated with the timestamps, the corresponding search queries and dialog acts for each of the roles. Codebook

  • Open Data Exploration dataset for the conversational browsing task contains 26 transcripts annotated with dialog acts and entity spans. Codebook License: MIT.

Conversation Logs

Format CSV for importing into ProM. One message/dialog act per row.

Basic columns:

  • case ID - conversation identifier
  • resource - actor role of the conversation participant
  • activity name - dialog (speech) act

Optional columns:

  • start time, stop time - timestamps reflect ordering of messages along the time axis
  • message count - counts the number of messages exchanged within a conversation
  • message - transcript of the utterance
  • query - information need describing the task (instruction) that participants are solving

DSTC1&2

  • turn count - counts the pairs of messages exchanged within a conversation
  • slots - message attributes from the domain ontology

SCS:

  • Query.complexity - one of three levels, referencing the task complexiy type (remember, understand, and analyze)
  • Notes - comments such as the particular search is stopped by the user or researcher or extra notes which relate to the action of the participant regarding the search session.

SWDA

  • length - duration of the conversation in seconds
  • caller_dialect_area - geo identifier for the cluster of resources from the set of {MIXED, NEW ENGLAND, NORTH MIDLAND, NORTHERN, NYC, SOUTH MIDLAND, SOUTHERN, UNK, WESTERN}

Annotations

Conducted by 2 annotators

Annotation schema: Krippendorff's alpha, 0.997

Dialogue success evaluation: Krippendorff's alpha 0.726

References

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.