Giter Club home page Giter Club logo

qcri / arabic_speech_code_switching Goto Github PK

View Code? Open in Web Editor NEW
13.0 8.0 1.0 267.64 MB

The first Dialectal Arabic Code Switching - DACS corpus from broadcast speech. Annotated at the token-level, considering both the linguistic and the acoustic cues. This dataset is a potential benchmark for DCS in spontaneous speech.

License: MIT License

codeswitching arabic dialect-identification mordern-standard-arabic egyptian asr evaluation acoustic lexical

arabic_speech_code_switching's Introduction

Dialectal Arabic Code-Switching Dataset.

This release includes the annotated two-hours Egyptian dataset from the ADI-5 development split in the MGB-3 challenge [1]. The released MGB-3 data includes speech features and textual features extracted from ASR transcription.

Unlike MGB-3:EGY, this dataset is manually segmented to the audio into smaller utterances (with 500 msec silence or more) and transcribed the speech verbatim by a lay native Egyptian speaker.

The transcribed data is then annotated for word-level Code-Switching (CS) information by 3 annotators. Using the guideline mentioned in the paper, the annotators were asked to classify the words into one of the following four categories: (i) MSA: MSA word with MSA pronunciations; (ii) EGY: Egyptian word; (iii) MIX: MSA word with dialectal pronunciations and (iv) FRN: Foreign word, i.e., not Arabic. In addition, a 'NULL' tag was assigned in case the word is unintelligible or cannot be categorised to one of the four labels.

More details in paper:

@inproceedings{chowdhury2020cs,
  title={Effects of Dialectal Code-Switching on Speech Modules: A Study using Egyptian Arabic Broadcast Speech},
  author={Chowdhury, Shammur Absar  and Samih, Younes and Eldesouki, Mohamed and Ali, Ahmed},
  booktitle={INTERSPEECH},
  year={2020}
}

also available in Paper

Data Format

DACS_word_level.feat The input file -- containing words and corresponding labels, are presented in DACS_word_level.feat. The file contains the following fields (space seperated), including #id word_index_in_sentence word word_start word_duration word_end label1 label2 label3

where #id is the corresponding wav id

word_index_in_sentence indicates the position of the word in the utterance.

word manually transcribed word (in Buckwalter transliteration format)

word_start start time of the word in secs.

word_duration duration of the word in secs.

word_end end time of the word in secs.

phone phone_conf phone_start phone_duration phone_end same info for phone (forced aligned)

label[1-3] annotation label provided by annotator [1-3]

segments_dacs The file include information of the manually segmented MGB-3:EGY to utterances. The file includes: segmented_id audio_id segment_start segment_end

where segmented_id is the wav id of the utterance

audio_id is the id of original audio file from MGB-3:EGY.

segment_start/end the start and the end time of the segmented utterances.

mgb3_audio_list.txt A list of audio files (MGB-3:EGY) used for this dataset. Can be directly downloaded given the url of the audio.

[1] Ali, Ahmed, Stephan Vogel, and Steve Renals. "Speech recognition challenge in the wild: Arabic MGB-3." 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017.

arabic_speech_code_switching's People

Contributors

shammur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.