Giter Club home page Giter Club logo

moa's Introduction

Mixture-of-Agents (MoA)

License arXiv Discord Twitter

MoA architecture

Overview · Quickstart · Advanced example · Interactive CLI Demo · Evaluation · Results . Credits

Overview

Mixture of Agents (MoA) is a novel approach that leverages the collective strengths of multiple LLMs to enhance performance, achieving state-of-the-art results. By employing a layered architecture where each layer comprises several LLM agents, MoA significantly outperforms GPT-4 Omni’s 57.5% on AlpacaEval 2.0 with a score of 65.1%, using only open-source models!

Quickstart: MoA in 50 LOC

To get to get started with using MoA in your own apps, see moa.py. In this simple example, we'll use 2 layers and 4 LLMs. You'll need to:

  1. Install the Together Python library: pip install together
  2. Get your Together API Key & export it: export TOGETHER_API_KEY=
  3. Run the python file: python moa.py

MoA explained

Multi-layer MoA Example

In the previous example, we went over how to implement MoA with 2 layers (4 LLMs answering and one LLM aggregating). However, one strength of MoA is being able to go through several layers to get an even better response. In this example, we'll go through how to run MoA with 3+ layers in advanced-moa.py.

python advanced-moa.py

MoA – 3 layer example

Interactive CLI Demo

This interactive CLI demo showcases a simple multi-turn chatbot where the final response is aggregated from various reference models.

To run the interactive demo, follow these 3 steps:

  1. Export Your API Key: export TOGETHER_API_KEY={your_key}
  2. Install Requirements: pip install -r requirements.txt
  3. Run the script: python bot.py

The CLI will prompt you to input instructions interactively:

  1. Start by entering your instruction at the ">>>" prompt.
  2. The system will process your input using the predefined reference models.
  3. It will generate a response based on the aggregated outputs from these models.
  4. You can continue the conversation by inputting more instructions, with the system maintaining the context of the multi-turn interaction.

[Optional] Additional Configuration

The demo will ask you to specify certain options but if you want to do additional configuration, you can specify these parameters:

  • --aggregator: The primary model used for final response generation.
  • --reference_models: List of models used as references.
  • --temperature: Controls the randomness of the response generation.
  • --max_tokens: Maximum number of tokens in the response.
  • --rounds: Number of rounds to process the input for refinement. (num rounds == num of MoA layers - 1)
  • --num_proc: Number of processes to run in parallel for faster execution.
  • --multi_turn: Boolean to toggle multi-turn interaction capability.

Evaluation

We provide scripts to quickly reproduce some of the results presented in our paper For convenience, we have included the code from AlpacaEval, MT-Bench, and FLASK, with necessary modifications. We extend our gratitude to these projects for creating the benchmarks.

Preparation

# install requirements
pip install -r requirements.txt
cd alpaca_eval
pip install -e .
cd FastChat
pip install -e ".[model_worker,llm_judge]"
cd ..

# setup api keys
export TOGETHER_API_KEY=<TOGETHER_API_KEY>
export OPENAI_API_KEY=<OPENAI_API_KEY>

Run AlpacaEval 2

To run AlpacaEval 2, execute the following scripts:

bash run_eval_alpaca_eval.sh

Run MT-Bench

For a minimal example of MT-Bench evaluation, run:

bash run_eval_mt_bench.sh

Run FLASK

For a minimal example of FLASK evaluation, run:

bash run_eval_flask.sh

Results

alpaca_mtbench

We achieved top positions on both the AlpacaEval 2.0 leaderboard and MT-Bench. Notably, on AlpacaEval 2.0, using solely open-source models, we achieved a margin of 7.6% absolute improvement from 57.5% (GPT-4 Omni) to 65.1% (MoA).

flask

FLASK offers fine-grained evaluation of models across multiple dimensions. Our MoA method significantly outperforms the original Qwen1.5-110B-Chat on harmlessness, robustness, correctness, efficiency, factuality, commonsense, insightfulness, completeness. Additionally, MoA also outperforms GPT-4 Omni in terms of correctness, factuality, insightfulness, completeness, and metacognition.

Please feel free to contact us if you have difficulties in reproducing the results.

Credits

Notably, this work was made possible by the collaborative spirit and contributions of active organizations in the AI field. We appreciate the efforts of Meta AI, Mistral AI, Microsoft, Alibaba Cloud, and DataBricks for developing the Llama 3, Mixtral, WizardLM 2, Qwen 1.5, and DBRX models. Additionally, we extend our gratitude to Tatsu Labs, LMSYS, and KAIST AI for developing the AlpacaEval, MT-Bench, and FLASK evaluation benchmarks.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Citation

If you find this work helpful, please consider citing:

@article{wang2024mixture,
  title={Mixture-of-Agents Enhances Large Language Model Capabilities},
  author={Wang, Junlin and Wang, Jue and Athiwaratkun, Ben and Zhang, Ce and Zou, James},
  journal={arXiv preprint arXiv:2406.04692},
  year={2024}
}

moa's People

Contributors

bcipolli avatar eltociear avatar isthatyou avatar lorrinwww avatar nutlope avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

moa's Issues

Run locally?

Was just wondering if i can run this locally?

Evaluation on Objective Benchmarks

I think this work is meaningful and provide remarkable results. However, I find all the test benchs are subjective benchs which outputs are judged by LLMs. Have you tried using MoA for objective tasks such as MMLU or MATH? I think this could make MoA even more valuable. Thanks!

--rounds 2 seems broken

The first round of LLMs work fine, but then the input to the second round of LLMs becomes:

"You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from models:
1. P
2. l
3. e
4. a
5. s
6. e
7.  
8. t
9. e
10. l
11. l
12.  
13. m
14. e
15.  
16. m
17. o
18. r
19. e

Agents?

Just curious, I found the title of the approach pretty confusing. "Mixture of Agents" to me implies some agentic behaviour -- LLMs using tools and decision making to accomplish a task.

However, if I'm correct, theres no agentic behaviour, tool use, or decision making here. Instead, the proposed method is more of a "Mixture of LLMs", where LLM calls are combined and iterated on.

Is my understanding correct?

Performance on Large Language Models with Fewer Parameters

hi, nice work~

I'm quite interested in the performance of this approach on large language models with fewer parameters, such as using three 7B models as a proposer and a 14B model as an aggregator, and so on. Have you made any attempts in this direction?

add a critique model to the mix to improve the answers

Enhanced Workflow with Scoring Mechanism for Model-Only Process
Scoring Criteria
The criteria for evaluation will remain the same:

Clarity: How clear and understandable the response is.
Relevance: How relevant the response is to the prompt.
Accuracy: How factually correct the response is.
Completeness: How thoroughly the response addresses the prompt.
Coherence: How logically consistent the response is.
Each criterion is scored from 0 to 10, and the overall score is the average of these scores. The minimum passing score will be set at 7.

Workflow Steps
Input Prompt:

The initial input is fed into the first layer.
Layer 1:

Three agents
𝐴
1
,
1
A
1,1

,
𝐴
1
,
2
A
1,2

, and
𝐴
1
,
3
A
1,3

process the input independently.
Intermediate outputs are generated and concatenated.
Critique 1 with Scoring:

A critique agent evaluates the concatenated output using the criteria (Clarity, Relevance, Accuracy, Completeness, Coherence).
Each criterion is scored from 0 to 10.
The overall score is the average of the criteria scores.
If the overall score is >= 7, the output is passed to Layer 2.
If the overall score is < 7, the output is sent back to Layer 1 for revision by the agents.
Layer 2:

The adjusted output from Critique 1 is processed by agents
𝐴
2
,
1
A
2,1

,
𝐴
2
,
2
A
2,2

, and
𝐴
2
,
3
A
2,3

.
Intermediate outputs are generated and concatenated.
Critique 2 with Scoring:

A critique agent evaluates the outputs from Layer 2 using the same criteria.
Outputs are scored and averaged.
If the overall score is >= 7, the output is passed to Layer 3.
If the overall score is < 7, the output is sent back to Layer 2 for revision by the agents.
Layer 3:

The adjusted output from Critique 2 is processed by agents
𝐴
3
,
1
A
3,1

,
𝐴
3
,
2
A
3,2

, and
𝐴
3
,
3
A
3,3

.
Intermediate outputs are generated and concatenated.
Critique 3 with Scoring:

A final critique agent evaluates the outputs from Layer 3.
Outputs are scored and averaged.
If the overall score is >= 7, the output is passed to Layer 4.
If the overall score is < 7, the output is sent back to Layer 3 for revision by the agents.
Layer 4:

The final adjusted output is processed by agent
𝐴
4
,
1
A
4,1

.
The Final Output is produced.
Final Output:
The output from Layer 4 is the final output, having passed all critique evaluations and scoring criteria.
Diagram Summary:
Input Prompt -> Layer 1 -> Critique 1 with Scoring -> (Pass if score >= 7 or Revise if score < 7) -> Layer 2 -> Critique 2 with Scoring -> (Pass or Revise) -> Layer 3 -> Critique 3 with Scoring -> (Pass or Revise) -> Layer 4 -> Final Output
Example Diagram Description:
Input Prompt: Initial input is fed into Layer 1.
Layer 1: Agents
𝐴
1
,
1
A
1,1

,
𝐴
1
,
2
A
1,2

, and
𝐴
1
,
3
A
1,3

process the input independently, generating intermediate outputs which are concatenated.
Critique 1 with Scoring: A critique agent evaluates the concatenated output, scoring it on clarity, relevance, accuracy, completeness, and coherence. If the score is >= 7, the output passes to Layer 2; otherwise, it is sent back to Layer 1.
Layer 2: Agents
𝐴
2
,
1
A
2,1

,
𝐴
2
,
2
A
2,2

, and
𝐴
2
,
3
A
2,3

process the adjusted output, generating new intermediate outputs which are concatenated.
Critique 2 with Scoring: The critique agent evaluates the new outputs, scoring them as before. Outputs scoring >= 7 pass to Layer 3; others are sent back to Layer 2.
Layer 3: Agents
𝐴
3
,
1
A
3,1

,
𝐴
3
,
2
A
3,2

, and
𝐴
3
,
3
A
3,3

process the further adjusted output, generating final intermediate outputs which are concatenated.
Critique 3 with Scoring: The final critique agent evaluates and scores the outputs. Outputs scoring >= 7 pass to Layer 4; others are sent back to Layer 3.
Layer 4: The final agent
𝐴
4
,
1
A
4,1

processes the output to produce the final answer.
This workflow ensures each layer's output meets a quality threshold before advancing, thereby enhancing the final output's overall quality.

seems that MoA does not work on MATH and QA with both weak and strong LLMs

I have thoroughly tested MoA (with one layer) on some objective benchmarks (less subjective compared to MT-bench), such as GSM8K, HotpotQA.
It seems that when the LLMs are 7B-level, it does not work anymore.
Here in my setting,
the three LLMs in layer one is mistralai/Mistral-7B-Instruct-v0.1/2/3, while the aggregator is meta-llama/Meta-Llama-3.1-8B-Instruct.
(before the experiment, I have tested each model's capability to solve the problem, the most powerful one is llama-3.1-8B).

Then, when applying MoA, I find that the performance decrease, for example, in GSM8K, the acc decreases from 75.1 to 61.3, where llama-3.1 solely achives 75.1, here rounds=0; while 61.3 is from rounds=1 that the intermidiate layer consists of the mistral-7B v0.1/2/3.

This finding also applies to HotpotQA.

Does anyone face the similar observation with me ? Any suggestions on how to use 7B-level llms ?

questions about the intermediate layers

hello, i cannot follow the mechanism in the intermediate layers from the paper.
It is easy to understand the first layer where each LLM takes the same prompt and generate the response separately. but how it works in the following layers ?
for example, in the second layer, does it take the original prompt and the concatenation of the outputs of the first layers ? then, if yes, what is the output of each LLM in 2nd layer ?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.