mesolitica / malaysian-dataset Goto Github PK
View Code? Open in Web Editor NEWWe gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
Home Page: https://malaysian-dataset.readthedocs.io/
License: Apache License 2.0
We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/
Home Page: https://malaysian-dataset.readthedocs.io/
License: Apache License 2.0
list(set(texts))
Hi and thanks for all your effort for the data-set
I'm trying to use twitter emotion data-set which but access is getting denied.
Are you aware of this issue?
Data-set link
Fasttext model trained on,
lang_labels_v2 = {
0: 'standard-english',
1: 'local-english',
2: 'manglish',
3: 'standard-indonesian',
4: 'socialmedia-indonesian',
5: 'standard-malay',
6: 'local-malay',
7: 'standard-mandarin',
8: 'local-mandarin',
9: 'other',
}
Steps to reproduce the fasttext training at https://github.com/huseinzol05/malaya/blob/5.1/pretrained-model/language-detection-v2/train-fasttext-auto.ipynb
How did you preprocess the dumps for Wikpedia for obtaining the sentences on .json file?
Hi huseinzol05,
Thanks for your github repo, really helpful. I am trying to pretrain albert-base-v2 on my own custom chatlogs (domain specific), I am training using 4 nvidia T4 GPU, model configuration same as albert-base, when i set batch size as 512, it takes around 20 mins for 1000 batches (i.e. steps), may I ask what is your training setup and training time for your albert/bert model pretrain? Thanks!
Regards
Xiaohong
Directory https://github.com/huseinzol05/malay-dataset/tree/master/crawl/harakah-daily
Need to dedup.
Crawling
The dataset will consist of malay, english, singlish / manglish and local mandarin
list(set(texts))
Directory https://github.com/huseinzol05/malay-dataset/tree/master/crawl/utusan-borneo
Need to dedup.
Directory https://github.com/huseinzol05/malay-dataset/tree/master/crawl/sinarharian
Need to dedup
Will going to use https://malaya.readthedocs.io/en/stable/load-nsfw.html
list(set(texts))
.Fasttext model trained on,
lang_labels_v2 = {
0: 'standard-english',
1: 'local-english',
2: 'manglish',
3: 'standard-indonesian',
4: 'socialmedia-indonesian',
5: 'standard-malay',
6: 'local-malay',
7: 'standard-mandarin',
8: 'local-mandarin',
9: 'other',
}
Steps to reproduce the fasttext training at https://github.com/huseinzol05/malaya/blob/5.1/pretrained-model/language-detection-v2/train-fasttext-auto.ipynb
list(set(texts))
Hi Husein,
Apologies for cross-posting, I tried to reach you via email but it seems that you are more active here.
I came across a notebook on emotion analysis that employs your English dataset of tweets [1]. Unfortunately, the link is not working. So, I wonder if you still have a copy of this dataset. Also, I would be grateful if you could provide me with more information about the creation process of the English datasets of emotions [1] and sentiment [2]. If you have a paper that I can also refer to (cite), it would be great.
[1] https://github.com/huseinzol05/NLP-Dataset/emotion-english
[2] https://github.com/huseinzol05/NLP-Dataset/sentiment-english
Best Regards,
Omnia
Hi, thanks for the dataset. I'm looking at the SQuAD v2.0 dataset and I am a little confused.
The json files are not formatted the same way as in the official SQuAD v2.0 dataset. Would you by any chance have a copy that's formatted in the same way? (same dict structure, same dict keys, correct answer_start etc.)
If not, may I find out more about how the dataset is currently formatted? I see a lot of "<>" and "| |".
Are the Malay parts presented like this:
Context <> Question || Answer || Answer <> Question || Answer || Answer || Answer <> Question || <> Question ||
So the <> separates context and QA examples, while the || separates the different parts within the QA example?
May I know how this dataset was obtained? Can I assume the answer will definitely be found in the context?
For the English parts included in the dataset, am I correct to assume the answer_start cannot be used, and I have to manually process the Malay part to find the correct answer_start?
Thank you!
Hello,
Thanks for publishing the speech datasets. One of the link is dead: https://s3-ap-southeast-1.amazonaws.com/malaya-dataset/speech-bahasa.zip
It's referred here: https://github.com/huseinzol05/Malay-Dataset/tree/master/speech/sebut-perkataan
list(set(texts))
Salam :) Saya ingin menggunakan malay-text.txt
dalam projek saya sendiri. Oleh kerana repository ini menggunakan license Apache 2.0, saya mengerti bahawa license yang original harus diletakkan di dalam folder di mana saya akan gunakan fail tersebut di dalam project saya pun. Namun, saya lihat bahawa license anda tidak lengkap dengan nama dan tahun (dan mungkin item-item lainnya). Adakah anda boleh isikan license di dalam repository ini supaya saya boleh gunakan fail yang telah anda bina dengan benar? Terima kasih!
list(set(texts))
rules based normalization bahasa -> ms-en noisy trained translation -> standard en -> en-ms translation.
malaya.normalize
.list(set(texts))
.Hi
I would like to know if the tweetids for the various Twitter dataset can be made publicly available. I m interested in pulling more metadata.
Thank you.
list(set(texts))
list(set(texts))
BPE tokenizer,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.