Giter Club home page Giter Club logo

chinese_metaphor_explanation's Introduction

CMDAG: A Chinese Metaphor Dataset with Annotated Grounds

This dataset provides a meticulously annotated and rigorously formatted collection of Chinese metaphors (~28k entries). The dataset has undergone expert quality checks and demonstrates universal characteristics. This makes it especially suitable for applications such as pre-training large-scale models.

🔍 Read this in Chinese.

💹 Statistics

  • 📏 Average sentence length: 59.137
  • 🚀 Number of tenor-vehicle pairs: 36002
  • 📄 Total contexts: 27989

✈️ Large Language Models (LLM) Experiment Progress

Progress status of various experiments with different models:

  • OpenAI Models🌐
    • ChatGPT 3.5 Turbo🤖
    • ChatGPT 4.0 🤖
  • ChatGLM🤓
  • Baichuan🌊
  • Belle🛎️
  • Baidu ERNIE🦅
  • TigerBot🐅
    • TigerBot-7B-Chat🐾
    • TigerBot-13B-Chat🐾
  • Chinese-alpaca🦙
    • Chinese-alpaca-7B🏮
    • Chinese-alpaca-13B🏮
    • Chinese-alpaca-33B🏮
  • Chinese-alpaca-2🦙
    • Chinese-alpaca-2-7B🎏
    • Chinese-alpaca-2-13B🎏
  • chinese-llama2-linly🦙
    • chinese-llama2-linly-7B🍃
    • chinese-llama2-linly-13B🍃
  • Qwen-7B-Chat🌟
  • Ziya-LLaMA-13B🌌

💼Model Selection Criteria:

  • 🌏 Supports Chinese
  • 🔄 Uses the latest version as much as possible
  • 🧪 Engineering Can perform CoT and Prompt Engineering

📜 Human evaluation progresss

  • OpenAI Models🌐
    • ChatGPT 3.5 Turbo🤖
    • ChatGPT 4.0 🤖
  • ChatGLM🤓
  • Baichuan🌊
  • Belle🛎️
  • Baidu ERNIE🦅
  • TigerBot🐅
    • TigerBot-7B-Chat🐾
    • TigerBot-13B-Chat🐾
  • Chinese-alpaca🦙
    • Chinese-alpaca-7B🏮
    • Chinese-alpaca-13B🏮
    • Chinese-alpaca-33B🏮
  • Chinese-alpaca-2🦙
    • Chinese-alpaca-2-7B🎏
    • Chinese-alpaca-2-13B🎏
  • chinese-llama2-linly🦙
    • chinese-llama2-linly-7B🍃
    • chinese-llama2-linly-13B🍃
  • Qwen-7B-Chat🌟
  • Ziya-LLaMA-13B🌌

💼 Model Selection Criteria:

  • 🌏 Supports Chinese
  • 🔄 Embraces the latest versions
  • 🧪 Capable of CoT and Prompt Engineering tasks

📖 Experiment Details

🎯 Purpose of the Experiment

The main aim is to showcase that this well-curated dataset, enriched with Ground annotations, can significantly assist large-scale models in metaphor and vehicle generation tasks. The evaluations for each model are consistent across six standardized settings.

🛠️ Experimental Setup

📊 Variables:

  • 📌 Clustering methods: There are two, one based on the [CLS] token embedding (sentence-level clustering) and the other on individual word embeddings (word-level clustering). These correspond to the functions cluster_and_save (referred to as Cluster 1) and cluster_and_save_word_embeddings (referred to as Cluster 2) in utils\bert_cluster.py.

  • 🎩 Task types: There are two main tasks: metaphor generation based on Ground CoT and metaphor generation based on Vehicle CoT.

Specifics regarding the tasks are as outlined:

  1. Ground-based CoT:

    • Experimental group: Starting with samples from existing clusters, the first step involves generating the shared Ground using tenors and vehicles. Metaphors are then generated using this derived Ground alongside the existing tenor and vehicle.
    • Control group: Without using CoT, directly create metaphors using existing tenors and vehicles.
  2. Vehicle-based CoT:

    • Experimental group: The first step involves generation of the vehicle using the tenor and Ground sampled from existing clusters. The subsequent step revolves around generating metaphors with the derived vehicle along with existing tenor and Ground.
    • Control group: Without CoT, directly form metaphors using the tenors and Ground.

🗄️ Sample Clusters: These are a select few examples (20 entries) that have been well-annotated with tenor, vehicle, Ground, and metaphor. They've been sampled from the training set of the original dataset based on the two clustering methods mentioned above. These samples are stored under Data\Selected Data.

🔎 Data for Experiments: The split data for the dataset resides under Data\Splited Data. The directory contains:

  • train_data: The first 80% of data sorted by ID, used as the primary source for LLM experiments.
  • val_data: The subsequent 10% after train_data, sorted by ID.
  • test_data: The following 10% after val_data, sorted by ID.
  • 📂 train_data_sampled_200.json: This is a cleansed and selectively sampled subset from train_data, chosen specifically for the large model experiments.

📜 Procedure:

  • 🖥️ Code Samples: The sample code to run experiments on OpenAI GPT-4 can be found under Code\OpenAI Sample Code. The specific CoT Prompt needs to be in Chinese. Prompts can be adjusted based on model outputs, but must remain informationally intact.
  • 📉 Sample Results: The results of the experiments with completed models are stored in the Experiment_Results/Model_Name directory. Each model's results are represented as six JSON files. They capture the prompts, outputs, and other data for the model across the six settings.

🌅 Visualization of Dataset Statistics

Various word clouds depicting the frequency and variety of terms used in the dataset:

Tenors Word Cloud_word_cloud Tenors (English) Word Cloud_word_cloud Vehicles Word Cloud_word_cloud Vehicles (English) Word Cloud_word_cloud Grounds (Noun) Word Cloud_word_cloud Grounds (Noun) (English) Word Cloud_word_cloud Grounds (Adjective) Word Cloud_word_cloud Grounds (Adjective) (English) Word Cloud_word_cloud

🖋️ Reference

If you utilize the code, data, or any models from this project, please cite the project as follows:

@inproceedings{cn_metaphor_yujie,
    title     = {CMDAG: A Chinese Metaphor Dataset with Annotated Grounds as CoT for Boosting Metaphor Generation},
    author    = {Yujie Shao and Xinrong Yao and Xingwei Qu and Chenghua Lin and Shi Wang and Stephen W. Huang and Ge Zhang and Jie Fu},
    year      = {2024},
    booktitle = {Proceedings of the LREC-COLING 2024},
    url       = {https://anonymous.4open.science/r/Chinese_Metaphor_Explanation-63F2},
}

chinese_metaphor_explanation's People

Contributors

jasonshao55 avatar xryao1113 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jiasir803

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.