darija-open-dataset / dataset Goto Github PK

darija <-> english dataset

License: Other

darija dataset language natural-language-processing translation

dataset's Introduction

Darija Open Dataset

Welcome to the Darija Open Dataset (DODa), an ambitious open-source project dedicated to the Moroccan dialect. With about 150,000 entries, DODa is arguably the largest open-source collaborative project for Darija <=> English translation built for Natural Language Processing purposes.

In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, as well as more that 86,000 translated sentences.

Additionally, DODa takes into account the diversity of Darija spellings used in various contexts, making it a versatile resource for language enthusiasts and NLP practitioners. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications.

Our primary goal is to establish DODa as the go-to reference for NLP in Darija. By providing a robust and diverse dataset, we aim to facilitate the development of NLP applications that can cater to the specific linguistic needs of the Moroccan community.

While we have made significant progress in compiling and organizing the dataset, it's important to note that parts of the dataset are still either under review or in progress, especially in the sentences.csv file. We welcome contributions from the Moroccan IT community to help us refine and expand the dataset further, ensuring its accuracy and completeness. Together, we can build a powerful foundation for future NLP innovations tailored to Moroccan culture and language.

Check out this introductory video about DODa.

How to contribute

You're free to navigate straight to the AtlasIA interface and start your contributions 🔥🔥.

Otherwise, if you prefer using dev tools, we've made a detailed video for you on how to contribute

TL;DW (Too Long Didn't Watch):

Go to Issues
Choose one and comment to have it assigned to you
Fork the Dataset Repository
Translate and fix typos in the file corresponding to your assigned issue
Open a Pull Request

Thank you for your contribution!!!

Guidelines / Recommendations

3ndk ح dir ح xD (shout-out to this guy 😆), often try to use:

darija	3	7	9	8	2 - 'a' - 'i'	5 - 'kh'
arabic	ع	ح	ق	ه	همزة	خ

Try to use capitalization to differentiate between the following letters:

t	T	s	S	d	D
ت	ط	س	ص	د	ض

Arabic characters with two-letters Latin equivalent:

Arabic alphabet	ش	غ	خ
Latin alphabet	ch	gh	kh

Double characters to refer to the emphasis or "الشدة":

darija	7mam	7mmam
english	pigeons	bathroom

We usually don't add "e" in the end of darija words : louz instead of louze
We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)
When using commas, don't forget to surround the expression by quotation marks (as we are using csv files)
We use spaces as word delimiters, not _ nor - : thank you instead of thank_you
Respect the number of columns in every row you add, you can use empty quotation marks "", or just empty placeholder, in case you don't have extra variations

"sou9","souk","","سوق","market"

sou9,souk,,مارشي,market

In each row, always start with the most used form (in your opinion of course) of the word in question
For future use of this dataset, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

sou9,souk,souq,سوق,market

marchi,,,مارشي,market

verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

ghnna,ghenna,ghanna,,,,غنّا,sing

masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

PyDODa - Python wrapper for the DODa

Pydoda is a comprehensive Python library that simplifies access and analysis of the DODa dataset. It enables effortless exploration of linguistic content for researchers, developers, and language enthusiasts by providing an intuitive interface for accessing various dataset categories, retrieving spellings and translations.

Integrating Pydoda into your Python workflow grants access to a wide range of functionalities, facilitating insights extraction from the DODa dataset, including semantic and syntactic analysis, translation retrieval, spelling variations exploration, and more.

Usage example

Pydoda could easily be installed using pip:

pip install pydoda

Here is a small code snippet:

from pydoda import Category

# Create an instance of Category
my_category = Category('semantic', 'animals')

# Get the Darija translation of a word
darija_translation = my_category.get_darija_translation('dog')
print(darija_translation)
# Output: klb

# Get the English translation of a word
english_translation = my_category.get_english_translation('mch')
print(english_translation)
# Output: 'cat'

For further details, visit the official Pydoda GitHub repository & official Pydoda documentation.

Usage Terms

Research and Personal Use: You are welcome to use DODa for research, personal projects, and educational purposes, free of charge, in accordance with the terms of the open-source license
Commercial Use: For commercial purposes or any other usage not covered by the open-source license, please contact the copyright holders Aissam or Hamza to discuss licensing options and permissions.

Citation

@misc{outchakoucht2024evolution,
      title={The Evolution of Darija Open Dataset: Introducing Version 2}, 
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2024},
      eprint={2405.13016},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

dataset's People

Contributors

Stargazers

Watchers

Forkers

fatiima-ezzahra redaelmar s0v1x ihabbendidi samery00 oussamaabb maranimoataz al4din starshums haoes rahi20 hamzaelanssari snoussi88 rc-bandit4461 shinwi fatal-error1 maariv mohcinsarrar enolgor-tt zouhair-isk chichak dy3l walidbousseta pillarxyz igmim-yassine cooltech90 mouad00 avx99 myihbach ilsx issamhub b3ns44d transcribot trkzi-omar ridaelfagrouch abde-exe youssefkharbouch locutus2017 0xpacman mohammedfatihx hanane-chablaoui saladin0x1 yas19sin imadchakri yassineziyad simsbxksi ukon1990 fadouaabdoul thenoobtester moun3imy zakarialaoui10 zgcharaf imanekhaouja translingua josephhh0 saad-out issamll m0rp43us zerga9 hamzaeialaoui younase darija-enjoyer yahyachahid abdellatif-laghjaj teleyosh moghwan kinothe-kafkaesque oumaima-ait-zaouit f-amine idrisselwaanabi manaf2808 maghwa andreasalem fariddinar imomayiz imadsaddik ayoubmdl yassirsalmi boogey9 hichambenaalioua o-abdenbi boogey11 mipaltan007 asmachkirida kuuhaakuu1 amine179 chiigaw femaleprog hamzaih2k-1108 zberrada zakarm abirarsalane taha-yassine

dataset's Issues

Translating 12.csv

Translating and fixing typos in 12.csv.

Translating 13.csv

Translating and fixing typos in 13.csv.

Translating 26.csv

Translating and fixing typos in 26.csv.

Translating 1.csv

Translating and fixing typos in 01.csv.

Hosting the dataset on DataStack?

Hello!

I enjoy seeing community data projects like this. As a former student of Cantonese (a dialect of Chinese spoken in HongKong), I know what it's like when the language you're interested in has no dictionary. At one point, I tried to create one myself, but gave up after 500 words. It's too much effort to do it alone.

The issue today is, that working together on these things is very technical. I'm sure your contributors here are familiar with Github and have no problem helping out, but it would be great if anyone with just the knowledge of English and Darija could help out, without technical skills.

For this reason I am building DataStack. It's a collaboration platform for table data, and works similarly to Github, but much more easy to use for data. No tools or technical knowledge needed.

To show you what it looks like, I've uploaded two files from your dataset there:

https://datastack.net/boukeversteegh/darija-open-dataset-demo/

If you're interested, please have a look. Does this solution suit your needs? What is still missing in your opinion?

Translating 24.csv

Translating and fixing typos in 24.csv.

Translating 40.csv

Translating and fixing typos in 40.csv.

typos in sentence.csv

I've been looking through the sentence.csv and found out that they are some typos in some sentences like using "8" as " h" or "7", and also they are some sentences that can have a variant of translated sentences for example: " kaychrab kass dyal lma" as " kaychrb kass dlma"
In addition, the use of "x" instead of "ch".

Translating 43.csv

Translating and fixing typos in 43.csv.

Translating 33.csv

Translating and fixing typos in 33.csv.

Translating 19.csv

Translating and fixing typos in 19.csv.

Translating 38.csv

Translating and fixing typos in 38.csv.

Translating 34.csv

Translating and fixing typos in 34.csv.

Translating 20.csv

Translating and fixing typos in 20.csv.

Translating 35.csv

Translating and fixing typos in 35.csv.

Fixing typos in translated.csv [6375 - End]

Fixing typos in the translated.csv file from line 6375 to the end of the file.

Translating 30.csv

Translating and fixing typos in 30.csv.

Translating 18.csv

Translating and fixing typos in 18.csv.

Translating 29.csv

Translating and fixing typos in 29.csv.

sentences2.csv makhdamch

bghit ndir had dataset f dataframe w kaitl3o li had l error

ParserError Traceback (most recent call last)
in
1 import pandas as pd
2
----> 3 print(pd.read_csv("dataset/sentences2.csv"))

~\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 1865, saw 4

o fach kandir rror_bad_lines=False kaitl3lya hdchi