Giter Club home page Giter Club logo

traditional-chinese-alpaca's Introduction

Traditional-Chinese Alpaca

This repo is for research purposes only, and the work presented is still in the early stage of development. The results are far from perfect and the generation quality varies significantly.


This repo aims to share resources for building Traditional-Chinese instruction-following language models (for research purposes only). This repo contains:

  • A Traditional-Chinese version of the Alpaca dataset with English alignment. See the dataset section for details. Our very simple alignment technique could work for other languages as well.
  • Code for training and inferencing the Traditional-Chinese Alpaca-LoRA.

Following are some good examples generated by our 7B, Traditional-Chinese Alpaca-LoRA.

image

image

image

image

Dataset

We translate the Stanford Alpaca 52k dataset directly to Traditional Chinese via the ChatGPT API (gpt-3.5-turbo), which cost us roughly 40 USD.

Specifically, this repo includes three sets of datasets:

  1. A Traditional-Chinese version of the Alpaca dataset. --> alpaca-tw.json
  2. A dataste same as 1. except the instruction part is left as English. --> alpaca-tw_en_instruction.json
  3. An aligned dataset, which simply combinines 1. and 2. --> alpaca-tw_en-align.json

In our preliminary experiments, fine-tuning with only the Trditional-Chinese dataset (i.e., dataset 1.) does not yield ideal results (e.g., degeneration, poor understanding). As LLaMA is trained primarily on English corpus, its ability to understanding other languages may require further alignments.

To this end, we create a Traditional-Chinese version of the Alpaca dataset with English alignment (i.e., dataste 3.), where beside the instruction-following task, the model can learn Chinese-English translation implicitly. The examples above are produced by training with this aligned dataset.

We hypothesize for some languages (e.g., spanish, portuguese) which share subword vocabulary with English, simply fine-tuning with the translated alpaca dataset would give great performance.

Training

The code for training the Traditional-Chinese Alpaca-LoRA is avaiblable here. It is based largely on Alpaca-LoRA and Cabrita. Our training is done on a single RTX 3090.

Inferencing

The code for inferencing the trained model is avaiblable here.

Next

  1. Fine-tune various multi-lingual foundation models (e.g., bloomz-7b1).
  2. Construct a large-scale Traditional-Chinese instruction-following dataset.
  3. Construct domain-specific Traditional-Chinese instruction-following datasets.

Please feel free to reach out (contact[at]nlg.csie.ntu.edu.tw) if you are interested in any forms of collaborations!

Reference

A large portion of our work relies on/motivated by LLaMA, Stanford Alpaca, Alpaca-LoRA, ChatGPT, Hugging Face, and Cabrita. We thanks the incredible individuals, groups, and communities for opening their amazing works!

Citation

If you use the data or code from this repo, please cite this repo as follows

@misc{traditional-chinese-alpaca,
  author = {Wei-Lin Chen and Cheng-Kuang Wu and Hsin-Hsi Chen},
  title = {Traditional-Chinese Alpaca: Models and Datasets},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ntunlplab/traditional-chinese-alpaca}},
}

traditional-chinese-alpaca's People

Contributors

wlchen0206 avatar william0206 avatar

Stargazers

 avatar  avatar seanmamasde avatar Hélder Monteiro avatar guyue avatar 魏伽卉 Sarah avatar Yu-Ying Chang avatar Yue Li avatar  avatar  avatar  avatar Cenlun Chung Po Lun avatar  avatar  avatar Mark avatar wuwenjie avatar Lin Sheng Chun avatar CHUN-YU, HSUEH avatar  avatar XCM avatar 李仲哲 avatar tc.lin (德全) avatar Martin avatar Tzu-Ting Chu avatar Yi avatar  avatar Spike Huang avatar yankchina avatar jell Wu avatar Naozumi avatar Kiu Huang avatar Ryan Cheng (鄭瑞龍) avatar HAO-REN WU avatar Jambo Hsu avatar Yu-Min Tseng avatar YiHang Tsai avatar Michael avatar Picker Weng avatar Daniel Hsieh avatar Yang Sheng Han avatar RainJay avatar Yen-Ting Lin avatar  avatar Cheng-Lin Tsai avatar  avatar weifan avatar  avatar Qun Yang avatar tucker yeh avatar Ting-Wen Ko avatar 林榮顯 avatar J. S. Lin avatar  avatar Alex Poone avatar  avatar  avatar 靈均 avatar  avatar Pokai Chang avatar Heng-Shiou Sheu avatar NiJia Lin avatar Yuchen Han avatar Po-Ju Wu avatar bansky-cl avatar  avatar Yu-Ting Lee avatar Wang Jaja avatar Yu-Cheng Liu avatar Tzu-Heng Huang avatar  avatar  avatar Qingkai Fang avatar  avatar  avatar Yu-Hung Wu avatar JERRYDDMAN avatar  avatar RayWu avatar 賴祺清 avatar Keith Hon avatar JenniferJuan avatar Shamy Ji avatar chris_zhp avatar  avatar Louis Hsu avatar  avatar Sandalots avatar Allen avatar  avatar 爱可可-爱生活 avatar  avatar Thiya avatar Guan-Ting (Daniel) Lin avatar Jean-Louis Queguiner avatar Eric Lam avatar Huan-Chieh Tseng avatar Joseph Cheng avatar Ellomorce avatar Andrea Shih avatar  avatar

Watchers

 avatar A-baoYang avatar JenniferJuan avatar  avatar  avatar

traditional-chinese-alpaca's Issues

Request for Model Weights and Environment Version Numbers

Hello,
I am currently working on using your GitHub repository and have encountered an issue when trying to run the inference.py script. I received the following error message:

ValueError: Can't find config.json at '../model/7b-tw_plus_en_ins-6_epoch'

It seems that the model weights and/or the config.json file are missing from the repository. I kindly request that you provide the necessary model weights and the config.json file, so I can continue with my testing.

Additionally, could you also provide the specific environment version numbers for the dependencies used in your project? This will help ensure compatibility and reduce potential issues when setting up the environment.

Thank you in advance for your assistance and prompt response.

使用簡體中文用語

Some of the sentences in alpaca-tw-en-align.json are in Simplified Chinese terms.

文本-->文字
字符-->字串
博客-->部落格

please correct the terms.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.