Giter Club home page Giter Club logo

fisher-callhome-corpus's Introduction

The Fisher and CALLHOME Spanish--English Speech Translation Corpus

The Fisher and CALLHOME Spanish--English Speech Translation Corpus contains English reference translations and speech recognizer output (in various forms) that complement the LDC Fisher and CALLHOME Spanish audio and transcripts. Together, they make a four-way parallel dataset whose goal is to further research in Spanish--English speech translation.

For a complete description of this corpus, and for citation in your own published research, please cite the following paper. A copy can be found in the doc/ directory.

@inproceedings{post2013improved,
  Title = {Improved Speech-to-Text Translation with the {F}isher and {C}allhome {S}panish--{E}nglish Speech Translation Corpus},
  Author = {Post, Matt and Kumar, Gaurav and Lopez, Adam and Karakos, Damianos and Callison-Burch, Chris and Khudanpur, Sanjeev},
  Booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)},
  Year = {2013},
  Address = {Heidelberg, Germany},
  Month = {December}
}

The mapping/ direocty contains files corresponding to our data splits. Each line in these files contains a reference to the LDC transcript file and line numbers.

The corpus/ directory houses the various pieces of the corpus. Each subdirectory contains (a) a single Spanish side and (b) either one (for Fisher training and all CALLHOME data) or four (for Fisher test sets) English references. The Spanish side always has the extension ".es", and varies among (a) LDC transcript (b) Kaldi ASR output (c) Kaldi lattice output and (d) lattice oracle paths.

Due to licensing restrictions, we cannot include the LDC Spanish transcripts with this dataset. We have, however, provided scripts that will construct our data splits. To build these, first define the environment variables $LDC2010T04 and $LDC96T17 to point to your LDC2010T04 and LDC96T17 installations, respectively. Then run:

make

You can also run the two scripts directly by manually listing the directories, e.g.,

./bin/build_fisher.sh $LDC2010T04
./bin/build_callhome.sh $LDC96T17

Either way, you should end up with the following generated files, in addition to the files included with the data release:

corpus/ldc/fisher_train.es
corpus/ldc/fisher_dev.es
corpus/ldc/fisher_dev2.es
corpus/ldc/fisher_test.es
corpus/ldc/callhome_train.es
corpus/ldc/callhome_devtest.es
corpus/ldc/callhome_evltest.es

fisher-callhome-corpus's People

Contributors

lewismc avatar mjpost avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.