Giter Club home page Giter Club logo

dataset's Introduction

Darija Open Dataset

Welcome to the Darija Open Dataset (DODa), an ambitious open-source project dedicated to the Moroccan dialect. With about 150,000 entries, DODa is arguably the largest open-source collaborative project for Darija <=> English translation built for Natural Language Processing purposes.

In fact, besides semantic categorization, DODa also adopts a syntactic one, presents words under different spellings, offers verb-to-noun and masculine-to-feminine correspondences, contains the conjugation of hundreds of verbs in different tenses, as well as more that 86,000 translated sentences.

Additionally, DODa takes into account the diversity of Darija spellings used in various contexts, making it a versatile resource for language enthusiasts and NLP practitioners. The dataset includes entries written in both Latin and Arabic alphabets, reflecting the linguistic variations and preferences found in different sources and applications.

Our primary goal is to establish DODa as the go-to reference for NLP in Darija. By providing a robust and diverse dataset, we aim to facilitate the development of NLP applications that can cater to the specific linguistic needs of the Moroccan community.

While we have made significant progress in compiling and organizing the dataset, it's important to note that parts of the dataset are still either under review or in progress, especially in the sentences.csv file. We welcome contributions from the Moroccan IT community to help us refine and expand the dataset further, ensuring its accuracy and completeness. Together, we can build a powerful foundation for future NLP innovations tailored to Moroccan culture and language.


How to contribute

You're free to navigate straight to the AtlasIA interface and start your contributions 🔥🔥.

Otherwise, if you prefer using dev tools, we've made a detailed video for you on how to contribute

TL;DW (Too Long Didn't Watch):

  1. Go to Issues
  2. Choose one and comment to have it assigned to you
  3. Fork the Dataset Repository
  4. Translate and fix typos in the file corresponding to your assigned issue
  5. Open a Pull Request

Thank you for your contribution!!!

Guidelines / Recommendations

  1. 3ndk ح dir ح xD (shout-out to this guy 😆), often try to use:
darija 3 7 9 8 2 - 'a' - 'i' 5 - 'kh'
arabic ع ح ق ه همزة خ
  1. Try to use capitalization to differentiate between the following letters:
t T s S d D
ت ط س ص د ض
  1. Arabic characters with two-letters Latin equivalent:
Arabic alphabet ش غ خ
Latin alphabet ch gh kh
  1. Double characters to refer to the emphasis or "الشدة":
darija 7mam 7mmam
english pigeons bathroom
  1. We usually don't add "e" in the end of darija words : louz instead of louze

  2. We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)

  3. When using commas, don't forget to surround the expression by quotation marks (as we are using csv files)

  4. We use spaces as word delimiters, not _ nor - : thank you instead of thank_you

  5. Respect the number of columns in every row you add, you can use empty quotation marks "", or just empty placeholder, in case you don't have extra variations

"sou9","souk","","سوق","market"

sou9,souk,,مارشي,market

  1. In each row, always start with the most used form (in your opinion of course) of the word in question

  2. For future use of this dataset, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

sou9,souk,souq,سوق,market

marchi,,,مارشي,market

  1. verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

ghnna,ghenna,ghanna,,,,غنّا,sing

  1. masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

PyDODa - Python wrapper for the DODa

Python Badge

Pydoda is a comprehensive Python library that simplifies access and analysis of the DODa dataset. It enables effortless exploration of linguistic content for researchers, developers, and language enthusiasts by providing an intuitive interface for accessing various dataset categories, retrieving spellings and translations.

Integrating Pydoda into your Python workflow grants access to a wide range of functionalities, facilitating insights extraction from the DODa dataset, including semantic and syntactic analysis, translation retrieval, spelling variations exploration, and more.

Usage example

Pydoda could easily be installed using pip:

pip install pydoda

Here is a small code snippet:

from pydoda import Category

# Create an instance of Category
my_category = Category('semantic', 'animals')

# Get the Darija translation of a word
darija_translation = my_category.get_darija_translation('dog')
print(darija_translation)
# Output: klb

# Get the English translation of a word
english_translation = my_category.get_english_translation('mch')
print(english_translation)
# Output: 'cat'

For further details, visit the official Pydoda GitHub repository & official Pydoda documentation.

Usage Terms

  • Research and Personal Use: You are welcome to use DODa for research, personal projects, and educational purposes, free of charge, in accordance with the terms of the open-source license

  • Commercial Use: For commercial purposes or any other usage not covered by the open-source license, please contact the copyright holders Aissam or Hamza to discuss licensing options and permissions.

Citation

@misc{outchakoucht2024evolution,
      title={The Evolution of Darija Open Dataset: Introducing Version 2}, 
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2024},
      eprint={2405.13016},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

dataset's People

Contributors

ahkecha avatar ai-sam avatar aissam-out avatar anasselhoud avatar asmachkirida avatar boogey9 avatar darija-enjoyer avatar darija-open-dataset avatar deepesh611 avatar f-amine avatar femaleprog avatar haoes avatar imadsaddik avatar issamll avatar kinothe-kafkaesque avatar kuuhaakuu1 avatar locutus2017 avatar maghwa avatar moun3imy avatar saad-out avatar shinwi avatar teleyosh avatar zakarm avatar zouhair-isk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataset's Issues

Hosting the dataset on DataStack?

Hello!

I enjoy seeing community data projects like this. As a former student of Cantonese (a dialect of Chinese spoken in HongKong), I know what it's like when the language you're interested in has no dictionary. At one point, I tried to create one myself, but gave up after 500 words. It's too much effort to do it alone.

The issue today is, that working together on these things is very technical. I'm sure your contributors here are familiar with Github and have no problem helping out, but it would be great if anyone with just the knowledge of English and Darija could help out, without technical skills.

For this reason I am building DataStack. It's a collaboration platform for table data, and works similarly to Github, but much more easy to use for data. No tools or technical knowledge needed.

To show you what it looks like, I've uploaded two files from your dataset there:

If you're interested, please have a look. Does this solution suit your needs? What is still missing in your opinion?

typos in sentence.csv

I've been looking through the sentence.csv and found out that they are some typos in some sentences like using "8" as " h" or "7", and also they are some sentences that can have a variant of translated sentences for example: " kaychrab kass dyal lma" as " kaychrb kass dlma"
In addition, the use of "x" instead of "ch".

image

sentences2.csv makhdamch

bghit ndir had dataset f dataframe w kaitl3o li had l error

ParserError Traceback (most recent call last)
in
1 import pandas as pd
2
----> 3 print(pd.read_csv("dataset/sentences2.csv"))

~\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.name = name

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
452
453 try:
--> 454 data = parser.read(nrows)
455 finally:
456 parser.close()

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1131 def read(self, nrows=None):
1132 nrows = _validate_integer("nrows", nrows)
-> 1133 ret = self._engine.read(nrows)
1134
1135 # May alter columns / col_dict

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2035 def read(self, nrows=None):
2036 try:
-> 2037 data = self._reader.read(nrows)
2038 except StopIteration:
2039 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 1865, saw 4

o fach kandir rror_bad_lines=False kaitl3lya hdchi

Capture d’écran 2021-05-07 162904

Thank you !

Thanks so much for this initiative.
I will try to contribute during my spare time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.