traditional-chinese-alpaca's Introduction

Traditional-Chinese Alpaca

This repo is for research purposes only, and the work presented is still in the early stage of development. The results are far from perfect and the generation quality varies significantly.

This repo aims to share resources for building Traditional-Chinese instruction-following language models (for research purposes only). This repo contains:

A Traditional-Chinese version of the Alpaca dataset with English alignment. See the dataset section for details. Our very simple alignment technique could work for other languages as well.
Code for training and inferencing the Traditional-Chinese Alpaca-LoRA.

Following are some good examples generated by our 7B, Traditional-Chinese Alpaca-LoRA.

Dataset

We translate the Stanford Alpaca 52k dataset directly to Traditional Chinese via the ChatGPT API (gpt-3.5-turbo), which cost us roughly 40 USD.

Specifically, this repo includes three sets of datasets:

A Traditional-Chinese version of the Alpaca dataset. --> alpaca-tw.json
A dataste same as 1. except the instruction part is left as English. --> alpaca-tw_en_instruction.json
An aligned dataset, which simply combinines 1. and 2. --> alpaca-tw_en-align.json

In our preliminary experiments, fine-tuning with only the Trditional-Chinese dataset (i.e., dataset 1.) does not yield ideal results (e.g., degeneration, poor understanding). As LLaMA is trained primarily on English corpus, its ability to understanding other languages may require further alignments.

To this end, we create a Traditional-Chinese version of the Alpaca dataset with English alignment (i.e., dataste 3.), where beside the instruction-following task, the model can learn Chinese-English translation implicitly. The examples above are produced by training with this aligned dataset.

We hypothesize for some languages (e.g., spanish, portuguese) which share subword vocabulary with English, simply fine-tuning with the translated alpaca dataset would give great performance.

Training

The code for training the Traditional-Chinese Alpaca-LoRA is avaiblable here. It is based largely on Alpaca-LoRA and Cabrita. Our training is done on a single RTX 3090.

Inferencing

The code for inferencing the trained model is avaiblable here.

Fine-tune various multi-lingual foundation models (e.g., bloomz-7b1).
Construct a large-scale Traditional-Chinese instruction-following dataset.
Construct domain-specific Traditional-Chinese instruction-following datasets.

Please feel free to reach out (contact[at]nlg.csie.ntu.edu.tw) if you are interested in any forms of collaborations!

Reference

A large portion of our work relies on/motivated by LLaMA, Stanford Alpaca, Alpaca-LoRA, ChatGPT, Hugging Face, and Cabrita. We thanks the incredible individuals, groups, and communities for opening their amazing works!

Citation

If you use the data or code from this repo, please cite this repo as follows

@misc{traditional-chinese-alpaca,
  author = {Wei-Lin Chen and Cheng-Kuang Wu and Hsin-Hsi Chen},
  title = {Traditional-Chinese Alpaca: Models and Datasets},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ntunlplab/traditional-chinese-alpaca}},
}

traditional-chinese-alpaca's People

Contributors

Stargazers

Watchers

traditional-chinese-alpaca's Issues

我們發佈了一份包含三種中文變體、非機器翻譯的數據集

https://guanaco-model.github.io/
https://huggingface.co/datasets/JosephusCheung/GuanacoDataset

Request for Model Weights and Environment Version Numbers

Hello,
I am currently working on using your GitHub repository and have encountered an issue when trying to run the inference.py script. I received the following error message:

ValueError: Can't find config.json at '../model/7b-tw_plus_en_ins-6_epoch'

It seems that the model weights and/or the config.json file are missing from the repository. I kindly request that you provide the necessary model weights and the config.json file, so I can continue with my testing.

Additionally, could you also provide the specific environment version numbers for the dependencies used in your project? This will help ensure compatibility and reduce potential issues when setting up the environment.

Thank you in advance for your assistance and prompt response.

使用簡體中文用語

Some of the sentences in alpaca-tw-en-align.json are in Simplified Chinese terms.

文本-->文字
字符-->字串
博客-->部落格

please correct the terms.

Recommend Projects

ntunlplab / traditional-chinese-alpaca Goto Github PK

traditional-chinese-alpaca's Introduction

Traditional-Chinese Alpaca

Dataset

Training

Inferencing

Next

Reference

Citation

traditional-chinese-alpaca's People

Contributors

Stargazers

Watchers

Forkers

traditional-chinese-alpaca's Issues

我們發佈了一份包含三種中文變體、非機器翻譯的數據集

Request for Model Weights and Environment Version Numbers

使用簡體中文用語

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent