mesolitica / malaysian-dataset Goto Github PK

View Code? Open in Web Editor NEW

291.0 291.0 104.0 1.39 GB

We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/

Home Page: https://malaysian-dataset.readthedocs.io/

License: Apache License 2.0

Python 0.62% Jupyter Notebook 80.43% Rich Text Format 0.79% HTML 18.15% Shell 0.01%

bahasa-melayu corpus malay-dataset malaysia manglish text-mining

malaysian-dataset's People

Contributors

Stargazers

Watchers

Forkers

sainiudit ikanez strategist922 anishsingh20 ridzuan05 mogaio mirzabaig715 qayyuum85 pinetum meelement decpaul razzaklebbai durbin-164 caoxu915683474 survikyal pozypakya batermj fitrialif shaiful-hisham tuannv0104 sunilsivadas hafsah-2018 hafidzdaud robingong dakmatt bigbear11 ducluze biranchi2018 aymansalama bluein-green khnamsehchi madamroziyani karthikmurugesan2 niarepo jay5555 ajinugroho abdiansah iannho phunxv289 amirazai123 beichao1314 yihhan asrafulsyifaa aakashspeaks ruruu127 akmalkhalid nurfarahfaiqah suhaimizs ar-mustakim wendonggan gumofunghi yb-huang dindatrhy rozabond vqoley rftft massodasuki shanai aidilrx04 linkerr darellelogram nuraimandanial sureniology aisyahrzk amzar96 farhan-helmy hazqeel09 muhammadfaiznoh williamcheong0616 ammar-azman wanadzhar913 atqnp amir1611 nurulhanina carrotzrule123 syafie-nzm kamaruladha yumdmb moiralah haizadtarik fiekzz hazmannaim naqibasri elinajose apikmeister jiahuei oscarnz azwanzuharimi aqilmarwan haziqzikry mindscapexyz izardy fattahharith webdevsha rulkimi radenmuaz fuadabdullah azrilhafizi brocapang

malaysian-dataset's Issues

b.cari.com.my (1.62 GB)

dedup using list(set(texts))
Directory https://github.com/huseinzol05/malay-dataset/blob/master/crawl/b-cari-com-my
Dataset https://huggingface.co/datasets/mesolitica/crawl-b-cari-com-my/resolve/main/everything.jsonl

Twitter emotion json data-set not reachable

Hi and thanks for all your effort for the data-set
I'm trying to use twitter emotion data-set which but access is getting denied.
Are you aware of this issue?

Data-set link

bar plot languages count LLM dataset

Fasttext model trained on,

lang_labels_v2 = {
    0: 'standard-english',
    1: 'local-english',
    2: 'manglish',
    3: 'standard-indonesian',
    4: 'socialmedia-indonesian',
    5: 'standard-malay',
    6: 'local-malay',
    7: 'standard-mandarin',
    8: 'local-mandarin',
    9: 'other',
}

Steps to reproduce the fasttext training at https://github.com/huseinzol05/malaya/blob/5.1/pretrained-model/language-detection-v2/train-fasttext-auto.ipynb

Preprocessing Wikipedia dumps

How did you preprocess the dumps for Wikpedia for obtaining the sentences on .json file?

Thanks for your github repo, really helpful. I am trying to pretrain albert-base-v2 on my own custom chatlogs (domain specific), I am training using 4 nvidia T4 GPU, model configuration same as albert-base, when i set batch size as 512, it takes around 20 mins for 1000 batches (i.e. steps), may I ask what is your training setup and training time for your albert/bert model pretrain? Thanks!

Regards
Xiaohong

Harakah daily (122 MB), added by https://github.com/aisyahrzk

Directory https://github.com/huseinzol05/malay-dataset/tree/master/crawl/harakah-daily

Need to dedup.

parlimen.gov.my (447 MB)

https://www.dragonforce.io/

Crawling

eprints (6.4 GB)

Download and prepare

The dataset will consist of malay, english, singlish / manglish and local mandarin

Need to dedup.

Sinar Harian (31 MB), added by https://github.com/amzar96

Directory https://github.com/huseinzol05/malay-dataset/tree/master/crawl/sinarharian

Need to dedup

kosmo (91 MB)

dataset https://huggingface.co/datasets/tnwei/ms-newspapers/resolve/main/kosmo-20230524.jsonl

bar plot NSFW LLM dataset

Will going to use https://malaya.readthedocs.io/en/stable/load-nsfw.html

amanz.my (47 MB)

calculate statistics

token counts based on 32k vocab size: 11,607,626,930
data size: 49.5 GB

Reddit (147 MB)

facebook (11.7 MB)

dedup using list(set(texts)).
Directory https://github.com/huseinzol05/malay-dataset/tree/master/dumping/facebook
Dataset https://huggingface.co/datasets/mesolitica/fb-malaysian-pages/resolve/main/dedup.jsonl

bar plot languages data size LLM dataset

Fasttext model trained on,

lang_labels_v2 = {
    0: 'standard-english',
    1: 'local-english',
    2: 'manglish',
    3: 'standard-indonesian',
    4: 'socialmedia-indonesian',
    5: 'standard-malay',
    6: 'local-malay',
    7: 'standard-mandarin',
    8: 'local-mandarin',
    9: 'other',
}

Steps to reproduce the fasttext training at https://github.com/huseinzol05/malaya/blob/5.1/pretrained-model/language-detection-v2/train-fasttext-auto.ipynb

hardwarezone.com.sg (5.03 GB)

dedup using list(set(texts))
Directory https://github.com/huseinzol05/malay-dataset/blob/master/crawl/hardwarezone-com-sg
dataset https://huggingface.co/datasets/mesolitica/crawl-hardwarezone-sg/resolve/main/everything.jsonl

bernama (50 MB)

Old NLP Dataset of English Emotions

Hi Husein,

Apologies for cross-posting, I tried to reach you via email but it seems that you are more active here.

I came across a notebook on emotion analysis that employs your English dataset of tweets [1]. Unfortunately, the link is not working. So, I wonder if you still have a copy of this dataset. Also, I would be grateful if you could provide me with more information about the creation process of the English datasets of emotions [1] and sentiment [2]. If you have a paper that I can also refer to (cite), it would be great.

[1] https://github.com/huseinzol05/NLP-Dataset/emotion-english
[2] https://github.com/huseinzol05/NLP-Dataset/sentiment-english

Best Regards,
Omnia

common-crawl (2.81 GB)

Translated SQuAD v2.0 format

Hi, thanks for the dataset. I'm looking at the SQuAD v2.0 dataset and I am a little confused.

The json files are not formatted the same way as in the official SQuAD v2.0 dataset. Would you by any chance have a copy that's formatted in the same way? (same dict structure, same dict keys, correct answer_start etc.)

If not, may I find out more about how the dataset is currently formatted? I see a lot of "<>" and "| |".
Are the Malay parts presented like this:
Context <> Question || Answer || Answer <> Question || Answer || Answer || Answer <> Question || <> Question ||

So the <> separates context and QA examples, while the || separates the different parts within the QA example?

May I know how this dataset was obtained? Can I assume the answer will definitely be found in the context?

For the English parts included in the dataset, am I correct to assume the answer_start cannot be used, and I have to manually process the Malay part to find the correct answer_start?

Thank you!

Bahasa speech data is not accessible

Hello,

Thanks for publishing the speech datasets. One of the link is dead: https://s3-ap-southeast-1.amazonaws.com/malaya-dataset/speech-bahasa.zip

It's referred here: https://github.com/huseinzol05/Malay-Dataset/tree/master/speech/sebut-perkataan

carigold (6.6 GB)

dedup using list(set(texts))
Directory https://github.com/huseinzol05/malay-dataset/blob/master/crawl/carigold
Dataset,

License copyright name and year

Salam :) Saya ingin menggunakan malay-text.txt dalam projek saya sendiri. Oleh kerana repository ini menggunakan license Apache 2.0, saya mengerti bahawa license yang original harus diletakkan di dalam folder di mana saya akan gunakan fail tersebut di dalam project saya pun. Namun, saya lihat bahawa license anda tidak lengkap dengan nama dan tahun (dan mungkin item-item lainnya). Adakah anda boleh isikan license di dalam repository ini supaya saya boleh gunakan fail yang telah anda bina dengan benar? Terima kasih!

lowyat (6 GB)

dedup using list(set(texts))
Directory, https://github.com/huseinzol05/malay-dataset/blob/master/crawl/lowyat
dataset, https://huggingface.co/datasets/mesolitica/crawl-lowyat

selangorkini (80.2 MB), added by https://github.com/aisyahrzk

Directory, https://github.com/huseinzol05/malay-dataset/tree/master/crawl/selangorkini

preparing abstractive normalization

rules based normalization bahasa -> ms-en noisy trained translation -> standard en -> en-ms translation.

rules based normalization bahasa from malaya.normalize.
ms-en noisy model, google translate is really good enough.
en-ms translation, somehow based on test set, en-ms translation from malaya is more acceptable for us.