Giter Club home page Giter Club logo

malaysian-dataset's People

Contributors

aisyahrzk avatar ammar-azman avatar amzar96 avatar atqnp avatar azrilhafizi avatar azwanzuharimi avatar brocapang avatar carrotzrule123 avatar farhan-helmy avatar fattahharith avatar haizadtarik avatar haziqzikry avatar hazmannaim avatar hazqeel09 avatar huseinzol05 avatar izardy avatar kamaruladha avatar moiralah avatar oscarnz avatar radenmuaz avatar rulkimi avatar syafie-nzm avatar wanadzhar913 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

malaysian-dataset's Issues

training time inquiry

Hi huseinzol05,

Thanks for your github repo, really helpful. I am trying to pretrain albert-base-v2 on my own custom chatlogs (domain specific), I am training using 4 nvidia T4 GPU, model configuration same as albert-base, when i set batch size as 512, it takes around 20 mins for 1000 batches (i.e. steps), may I ask what is your training setup and training time for your albert/bert model pretrain? Thanks!

Regards
Xiaohong

Download and prepare

The dataset will consist of malay, english, singlish / manglish and local mandarin

Old NLP Dataset of English Emotions

Hi Husein,

Apologies for cross-posting, I tried to reach you via email but it seems that you are more active here.

I came across a notebook on emotion analysis that employs your English dataset of tweets [1]. Unfortunately, the link is not working. So, I wonder if you still have a copy of this dataset. Also, I would be grateful if you could provide me with more information about the creation process of the English datasets of emotions [1] and sentiment [2]. If you have a paper that I can also refer to (cite), it would be great.

[1] https://github.com/huseinzol05/NLP-Dataset/emotion-english
[2] https://github.com/huseinzol05/NLP-Dataset/sentiment-english

Best Regards,
Omnia

Translated SQuAD v2.0 format

Hi, thanks for the dataset. I'm looking at the SQuAD v2.0 dataset and I am a little confused.

The json files are not formatted the same way as in the official SQuAD v2.0 dataset. Would you by any chance have a copy that's formatted in the same way? (same dict structure, same dict keys, correct answer_start etc.)

If not, may I find out more about how the dataset is currently formatted? I see a lot of "<>" and "| |".
Are the Malay parts presented like this:
Context <> Question || Answer || Answer <> Question || Answer || Answer || Answer <> Question || <> Question ||

So the <> separates context and QA examples, while the || separates the different parts within the QA example?

May I know how this dataset was obtained? Can I assume the answer will definitely be found in the context?

For the English parts included in the dataset, am I correct to assume the answer_start cannot be used, and I have to manually process the Malay part to find the correct answer_start?

Thank you!

License copyright name and year

Salam :) Saya ingin menggunakan malay-text.txt dalam projek saya sendiri. Oleh kerana repository ini menggunakan license Apache 2.0, saya mengerti bahawa license yang original harus diletakkan di dalam folder di mana saya akan gunakan fail tersebut di dalam project saya pun. Namun, saya lihat bahawa license anda tidak lengkap dengan nama dan tahun (dan mungkin item-item lainnya). Adakah anda boleh isikan license di dalam repository ini supaya saya boleh gunakan fail yang telah anda bina dengan benar? Terima kasih!

preparing abstractive normalization

rules based normalization bahasa -> ms-en noisy trained translation -> standard en -> en-ms translation.

  1. rules based normalization bahasa from malaya.normalize.
  2. ms-en noisy model, google translate is really good enough.
  3. en-ms translation, somehow based on test set, en-ms translation from malaya is more acceptable for us.

Wikipedia Jawi (1.5 GB)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.