togethercomputer / moa Goto Github PK

Together Mixture-Of-Agents (MoA) – 65.1% on AlpacaEval with OSS models

License: Apache License 2.0

Python 92.08% Dockerfile 0.02% Shell 0.65% Jupyter Notebook 6.28% HTML 0.97%

moa's Introduction

Mixture-of-Agents (MoA)

Overview · Quickstart · Advanced example · Interactive CLI Demo · Evaluation · Results . Credits

Overview

Mixture of Agents (MoA) is a novel approach that leverages the collective strengths of multiple LLMs to enhance performance, achieving state-of-the-art results. By employing a layered architecture where each layer comprises several LLM agents, MoA significantly outperforms GPT-4 Omni’s 57.5% on AlpacaEval 2.0 with a score of 65.1%, using only open-source models!

Quickstart: MoA in 50 LOC

To get to get started with using MoA in your own apps, see moa.py. In this simple example, we'll use 2 layers and 4 LLMs. You'll need to:

Install the Together Python library: pip install together
Get your Together API Key & export it: export TOGETHER_API_KEY=
Run the python file: python moa.py

Multi-layer MoA Example

In the previous example, we went over how to implement MoA with 2 layers (4 LLMs answering and one LLM aggregating). However, one strength of MoA is being able to go through several layers to get an even better response. In this example, we'll go through how to run MoA with 3+ layers in advanced-moa.py.

python advanced-moa.py

Interactive CLI Demo

This interactive CLI demo showcases a simple multi-turn chatbot where the final response is aggregated from various reference models.

To run the interactive demo, follow these 3 steps:

Export Your API Key: export TOGETHER_API_KEY={your_key}
Install Requirements: pip install -r requirements.txt
Run the script: python bot.py

The CLI will prompt you to input instructions interactively:

Start by entering your instruction at the ">>>" prompt.
The system will process your input using the predefined reference models.
It will generate a response based on the aggregated outputs from these models.
You can continue the conversation by inputting more instructions, with the system maintaining the context of the multi-turn interaction.

[Optional] Additional Configuration

The demo will ask you to specify certain options but if you want to do additional configuration, you can specify these parameters:

--aggregator: The primary model used for final response generation.
--reference_models: List of models used as references.
--temperature: Controls the randomness of the response generation.
--max_tokens: Maximum number of tokens in the response.
--rounds: Number of rounds to process the input for refinement. (num rounds == num of MoA layers - 1)
--num_proc: Number of processes to run in parallel for faster execution.
--multi_turn: Boolean to toggle multi-turn interaction capability.

Evaluation

We provide scripts to quickly reproduce some of the results presented in our paper For convenience, we have included the code from AlpacaEval, MT-Bench, and FLASK, with necessary modifications. We extend our gratitude to these projects for creating the benchmarks.

Preparation

# install requirements
pip install -r requirements.txt
cd alpaca_eval
pip install -e .
cd FastChat
pip install -e ".[model_worker,llm_judge]"
cd ..

# setup api keys
export TOGETHER_API_KEY=<TOGETHER_API_KEY>
export OPENAI_API_KEY=<OPENAI_API_KEY>

Run AlpacaEval 2

To run AlpacaEval 2, execute the following scripts:

bash run_eval_alpaca_eval.sh

Run MT-Bench

For a minimal example of MT-Bench evaluation, run:

bash run_eval_mt_bench.sh

Run FLASK

For a minimal example of FLASK evaluation, run:

bash run_eval_flask.sh

Results

We achieved top positions on both the AlpacaEval 2.0 leaderboard and MT-Bench. Notably, on AlpacaEval 2.0, using solely open-source models, we achieved a margin of 7.6% absolute improvement from 57.5% (GPT-4 Omni) to 65.1% (MoA).

FLASK offers fine-grained evaluation of models across multiple dimensions. Our MoA method significantly outperforms the original Qwen1.5-110B-Chat on harmlessness, robustness, correctness, efficiency, factuality, commonsense, insightfulness, completeness. Additionally, MoA also outperforms GPT-4 Omni in terms of correctness, factuality, insightfulness, completeness, and metacognition.

Please feel free to contact us if you have difficulties in reproducing the results.

Credits

Notably, this work was made possible by the collaborative spirit and contributions of active organizations in the AI field. We appreciate the efforts of Meta AI, Mistral AI, Microsoft, Alibaba Cloud, and DataBricks for developing the Llama 3, Mixtral, WizardLM 2, Qwen 1.5, and DBRX models. Additionally, we extend our gratitude to Tatsu Labs, LMSYS, and KAIST AI for developing the AlpacaEval, MT-Bench, and FLASK evaluation benchmarks.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Citation

If you find this work helpful, please consider citing:

@article{wang2024mixture,
  title={Mixture-of-Agents Enhances Large Language Model Capabilities},
  author={Wang, Junlin and Wang, Jue and Athiwaratkun, Ben and Zhang, Ce and Zou, James},
  journal={arXiv preprint arXiv:2406.04692},
  year={2024}
}

moa's People

Contributors

Stargazers

Watchers

Forkers

kustomzone saquib-mehmood roobenstrike eltociear joeaelkhoury nymbayar jjhw codeaudit geraldsanga lalomorales22 tomchapin frankyu83 thomascherickal marc-shade carlosdvp davgit swappybizz startstuidopro dawsonsimon minorole jefedeoro yeswici jjohnson139 thomaskalnik keithofaptos chrisx101010 ia-ml alessoh mojowebs weareastral srossitto79 maxakbar alexpwrd bitnom dagdelo oiixsalt chazpondre ototao akashjss mercuryyy bcipolli zippynetworks izntariq tecworks-dev allwavemedia bolecki ltstarbuck sammcj peteinakl cyberhipp the-netwrk aphexddb malawadd 32bitmicro steveterry66 russpalms sha01in gerx07 yesssirrrr1 suryatmodulus bihua-ai redfoxau vitorius8 shuotang123 yuanjun5681 zook111 lpai-org ylcn91 l-earner jason-shen ehasekamp ultrasonex shacharbard jinwoongyoo jsenapati 460130107 fiditenemini dmater01 stevegyutyan rifcheesy mahmoudzareef percipientcapital boubou78000 tattrongvu jahanavi202 peterzs uciadonis sarnaz1304 gligoran nnwhisperer techthiyanes ivolazy ai-in-pm zxc1221 petitetech pranav777-coder 1490zcy metacardbuilder narsis77 heysaeed

moa's Issues

You have been rate limited.

who can help me to slove the problem.
{'error': {'message': 'You have been rate limited. Your rate limit is 60 queries per minute. Please navigate to https://api.together.xyz/settings/billing to upgrade to a paid plan.', 'type': 'credit_limit', 'param': None, 'code': None}}

Run locally?

Was just wondering if i can run this locally?

ollama support

ollama support at this link https://github.com/win4r/MoA

Evaluation on Objective Benchmarks

I think this work is meaningful and provide remarkable results. However, I find all the test benchs are subjective benchs which outputs are judged by LLMs. Have you tried using MoA for objective tasks such as MMLU or MATH? I think this could make MoA even more valuable. Thanks!

Missing AlpacaEval gpt4 reference results - results/gpt4_1106_preview/model_outputs.json

Thanks for the great repo + paper team!

I think results/gpt4_1106_preview/model_outputs.json is missing from the alpaca_eval directory, I'm getting a warning atm:

https://github.com/tatsu-lab/alpaca_eval/blob/main/results/gpt4_1106_preview/model_outputs.json

The eval runs fine once I add the file to alpaca_eval/results/gpt4_1106_preview/model_outputs.json

--rounds 2 seems broken

The first round of LLMs work fine, but then the input to the second round of LLMs becomes:

"You have been provided with a set of responses from various open-source models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.

Responses from models:
1. P
2. l
3. e
4. a
5. s
6. e
7.  
8. t
9. e
10. l
11. l
12.  
13. m
14. e
15.  
16. m
17. o
18. r
19. e

Agents?

Just curious, I found the title of the approach pretty confusing. "Mixture of Agents" to me implies some agentic behaviour -- LLMs using tools and decision making to accomplish a task.

However, if I'm correct, theres no agentic behaviour, tool use, or decision making here. Instead, the proposed method is more of a "Mixture of LLMs", where LLM calls are combined and iterated on.

Is my understanding correct?

Performance on Large Language Models with Fewer Parameters

hi, nice work~

I'm quite interested in the performance of this approach on large language models with fewer parameters, such as using three 7B models as a proposer and a 14B model as an aggregator, and so on. Have you made any attempts in this direction?

does it support `ollama` ?

Does the agent support GPT, Gemini, Claude, etc.

Hello contributors! I just want to know whether the agent supports GPT, Gemini, Claude, etc. as proposers. Thanks.

Error occur after all 805 tests

So I have saved my outputs, what should I redo?

How can i use it with close source model. Gpt4o, 1.5 sonnet

add a critique model to the mix to improve the answers

Enhanced Workflow with Scoring Mechanism for Model-Only Process
Scoring Criteria
The criteria for evaluation will remain the same:

Clarity: How clear and understandable the response is.
Relevance: How relevant the response is to the prompt.
Accuracy: How factually correct the response is.
Completeness: How thoroughly the response addresses the prompt.
Coherence: How logically consistent the response is.
Each criterion is scored from 0 to 10, and the overall score is the average of these scores. The minimum passing score will be set at 7.

Workflow Steps
Input Prompt:

The initial input is fed into the first layer.
Layer 1:

Three agents
𝐴
1
,
1
A
1,1

,
𝐴
1
,
2
A
1,2

, and
𝐴
1
,
3
A
1,3

process the input independently.
Intermediate outputs are generated and concatenated.
Critique 1 with Scoring:

A critique agent evaluates the concatenated output using the criteria (Clarity, Relevance, Accuracy, Completeness, Coherence).
Each criterion is scored from 0 to 10.
The overall score is the average of the criteria scores.
If the overall score is >= 7, the output is passed to Layer 2.
If the overall score is < 7, the output is sent back to Layer 1 for revision by the agents.
Layer 2:

The adjusted output from Critique 1 is processed by agents
𝐴
2
,
1
A
2,1

,
𝐴
2
,
2
A
2,2

, and
𝐴
2
,
3
A
2,3

.
Intermediate outputs are generated and concatenated.
Critique 2 with Scoring:

A critique agent evaluates the outputs from Layer 2 using the same criteria.
Outputs are scored and averaged.
If the overall score is >= 7, the output is passed to Layer 3.
If the overall score is < 7, the output is sent back to Layer 2 for revision by the agents.
Layer 3:

The adjusted output from Critique 2 is processed by agents
𝐴
3
,
1
A
3,1

,
𝐴
3
,
2
A
3,2

, and
𝐴
3
,
3
A
3,3

.
Intermediate outputs are generated and concatenated.
Critique 3 with Scoring:

A final critique agent evaluates the outputs from Layer 3.
Outputs are scored and averaged.
If the overall score is >= 7, the output is passed to Layer 4.
If the overall score is < 7, the output is sent back to Layer 3 for revision by the agents.
Layer 4:

The final adjusted output is processed by agent
𝐴
4
,
1
A
4,1

.
The Final Output is produced.
Final Output:
The output from Layer 4 is the final output, having passed all critique evaluations and scoring criteria.
Diagram Summary:
Input Prompt -> Layer 1 -> Critique 1 with Scoring -> (Pass if score >= 7 or Revise if score < 7) -> Layer 2 -> Critique 2 with Scoring -> (Pass or Revise) -> Layer 3 -> Critique 3 with Scoring -> (Pass or Revise) -> Layer 4 -> Final Output
Example Diagram Description:
Input Prompt: Initial input is fed into Layer 1.
Layer 1: Agents
𝐴
1
,
1
A
1,1

,
𝐴
1
,
2
A
1,2

, and
𝐴
1
,
3
A
1,3

process the input independently, generating intermediate outputs which are concatenated.
Critique 1 with Scoring: A critique agent evaluates the concatenated output, scoring it on clarity, relevance, accuracy, completeness, and coherence. If the score is >= 7, the output passes to Layer 2; otherwise, it is sent back to Layer 1.
Layer 2: Agents
𝐴
2
,
1
A
2,1

,
𝐴
2
,
2
A
2,2

, and
𝐴
2
,
3
A
2,3

process the adjusted output, generating new intermediate outputs which are concatenated.
Critique 2 with Scoring: The critique agent evaluates the new outputs, scoring them as before. Outputs scoring >= 7 pass to Layer 3; others are sent back to Layer 2.
Layer 3: Agents
𝐴
3
,
1
A
3,1

,
𝐴
3
,
2
A
3,2

, and
𝐴
3
,
3
A
3,3

process the further adjusted output, generating final intermediate outputs which are concatenated.
Critique 3 with Scoring: The final critique agent evaluates and scores the outputs. Outputs scoring >= 7 pass to Layer 4; others are sent back to Layer 3.
Layer 4: The final agent
𝐴
4
,
1
A
4,1

processes the output to produce the final answer.
This workflow ensures each layer's output meets a quality threshold before advancing, thereby enhancing the final output's overall quality.

seems that MoA does not work on MATH and QA with both weak and strong LLMs

I have thoroughly tested MoA (with one layer) on some objective benchmarks (less subjective compared to MT-bench), such as GSM8K, HotpotQA.
It seems that when the LLMs are 7B-level, it does not work anymore.
Here in my setting,
the three LLMs in layer one is mistralai/Mistral-7B-Instruct-v0.1/2/3, while the aggregator is meta-llama/Meta-Llama-3.1-8B-Instruct.
(before the experiment, I have tested each model's capability to solve the problem, the most powerful one is llama-3.1-8B).

Then, when applying MoA, I find that the performance decrease, for example, in GSM8K, the acc decreases from 75.1 to 61.3, where llama-3.1 solely achives 75.1, here rounds=0; while 61.3 is from rounds=1 that the intermidiate layer consists of the mistral-7B v0.1/2/3.

This finding also applies to HotpotQA.

Does anyone face the similar observation with me ? Any suggestions on how to use 7B-level llms ?

how to deploy this locally with ollama UIs like `Open WebUI` and `Lobe Chat` ?

questions about the intermediate layers

hello, i cannot follow the mechanism in the intermediate layers from the paper.
It is easy to understand the first layer where each LLM takes the same prompt and generate the response separately. but how it works in the following layers ?
for example, in the second layer, does it take the original prompt and the concatenation of the outputs of the first layers ? then, if yes, what is the output of each LLM in 2nd layer ?

Thanks.

Could not find where the three "layers" are implemented in the code.

Please clarify where does the aggregation happen in the "layer" level. Otherwise, this is quite similar to MOE. Thanks.

why the implement of `moa.py` isn't consistent with `inject_references_to_messages`

In the moa.py, the proposers' responses are jointed in user prompt, but the jointed content inside inject_references_to_messages is in system prompt. Why there is a inconsistency？