This dataset contains subtitles from YouTube videos in Moroccan Darija, a colloquial Arabic dialect spoken in Morocco. The subtitles were collected from several popular Moroccan YouTube channels, providing a diverse set of transcriptions in the Darija language.
Dataset Description
The dataset is provided as a CSV file, where each row represents a YouTube video and contains the following columns:
video_id: The unique identifier of the YouTube video.
title: The title of the YouTube video.
transcript: The transcript of the video in Moroccan Darija without timestamps.
The subtitles cover a wide range of topics, including entertainment, news, history, and more, offering a comprehensive representation of the Moroccan Darija language as used in YouTube content.
Dataset Structure
The dataset is a single CSV file named moroccan_darija_subtitles.csv. The CSV file has the following structure:
video_id,title,transcript
video1_id,Video 1 Title,Video 1 Transcript in Moroccan Darija
video2_id,Video 2 Title,Video 2 Transcript in Moroccan Darija
...
The first row of the CSV file contains the column headers: video_id, title, and transcript.
Each subsequent row represents a YouTube video and its corresponding subtitle information.