microsoft / tora Goto Github PK

ToRA is a series of Tool-integrated Reasoning LLM Agents designed to solve challenging mathematical reasoning problems by interacting with tools [ICLR'24].

Home Page: https://microsoft.github.io/ToRA/

License: MIT License

Python 95.92% Shell 4.08%

autonomous-agents language-model llm mathematical-reasoning tool-learning

tora's Introduction

ToRA: A Tool-Integrated Reasoning Agent

[🌐 Website] • [📜 Paper] • [🤗 HF Models] • [🐱 GitHub]
[🐦 Twitter] • [💬 Reddit] • [🍀 Unofficial Blog]

Repo for "ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving" [ICLR'2024]

Figure 1: Comparing ToRA with baselines on LLaMA-2 base models from 7B to 70B.

🔥 News

[2023/10/08] 🔥🔥🔥 All ToRA models released at 🤗 HuggingFace!
[2023/09/29] ToRA paper, repo, and website released.

💡 Introduction

ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical reasoning problems by interacting with tools, e.g., computation libraries and symbolic solvers. ToRA series seamlessly integrate natural language reasoning with the utilization of external tools, thereby amalgamating the analytical prowess of language and the computational efficiency of external tools.

Model	Size	GSM8k	MATH	AVG@10 math tasks^†
GPT-4	-	92.0	42.5	78.3
GPT-4 (PAL)	-	94.2	51.8	86.4
ToRA-7B	7B	68.8	40.1	62.4
ToRA-Code-7B	7B	72.6	44.6	66.5
ToRA-Code-7B + self-consistency (k=50)	7B	76.8	52.5	-
ToRA-13B	13B	72.7	43.0	65.9
ToRA-Code-13B	13B	75.8	48.1	71.3
ToRA-Code-13B + self-consistency (k=50)	13B	80.4	55.1	-
ToRA-Code-34B^*	34B	80.7	51.0	74.8
ToRA-Code-34B + self-consistency (k=50)	34B	85.1	60.0	-
ToRA-70B	70B	84.3	49.7	76.9
ToRA-70B + self-consistency (k=50)	70B	88.3	56.9	-

^*ToRA-Code-34B is currently the first and only open-source model to achieve over 50% accuracy (pass@1) on the MATH dataset, which significantly outperforms GPT-4’s CoT result (51.0 vs. 42.5), and is competitive with GPT-4 solving problems with programs. By open-sourcing our codes and models, we hope more breakthroughs will come!
^†10 math tasks include GSM8k, MATH, GSM-Hard, SVAMP, TabMWP, ASDiv, SingleEQ, SingleOP, AddSub, and MultiArith.

Tool-Integrated Reasoning

Figure 2: A basic example of single-round tool interaction, which interleaves rationales with program-based tool use.

ToRA Training Pipeline

Figure 3: Training ToRA contains ① Imitation Learning, and ② output space shaping.

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. We use vLLM (0.1.4) to accelerate inference. Run the following commands to setup your environment:

git clone https://github.com/microsoft/ToRA.git && cd ToRA/src
conda create -n tora python=3.10
conda activate tora
pip install packaging==22.0
pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8 for example
pip install -r requirements.txt

🪁 Inference

We provide a script for inference, simply config the MODEL_NAME_OR_PATH and DATA in src/scripts/infer.sh and run the following command:

bash scritps/infer.sh

We also open-source the model outputs from our best models (ToRA-Code-34B and ToRA-70B) in the src/outputs/ folder.

⚖️ Evaluation

The src/eval/grader.py file contains the grading logic that assesses the accuracy of the predicted answer by comparing it to the ground truth. This logic is developed based on the Hendrycks' MATH grading system, which we have manually verified on the MATH dataset to minimize false positives and false negatives.

To evaluate the predicted answer, run the following command:

python -m eval.evaluate \
    --data_name "math" \
    --prompt_type "tora" \
    --file_path "outputs/llm-agents/tora-code-34b-v1.0/math/test_tora_-1_seed0_t0.0_s0_e5000.jsonl" \
    --execute

then you will get:

Num samples: 5000
Num scores: 5000
Timeout samples: 0
Empty samples: 2
Mean score: [51.0]
Type scores: {'Algebra': 67.3, 'Counting & Probability': 42.2, 'Geometry': 26.1, 'Intermediate Algebra': 40.0, 'Number Theory': 59.3, 'Prealgebra': 63.8, 'Precalculus': 34.2}

⚡️ Training

We're currently undergoing an internal review to open-source the ToRA-Corpus-16k, stay tuned! We also open-source our complete training scripts for the community, and you may construct your own dataset for training. We provide some example data in data/tora/.

To train a model, run the following command:

bash scripts/train.sh codellama 7b

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@inproceedings{
gou2024tora,
title={To{RA}: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving},
author={Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Ep0TtjVoap}
}

🍀 Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

🌟 Star History

tora's People

Contributors

Stargazers

Watchers

tora's Issues

long context input

Does the code support the long-context window?

System Requirements? Is it possible to run the ToRA-Code-34B on a Windows 10 PC with a 3090ti?

as the title says, Is it possible to run the ToRA-Code-34B on a Windows 10 PC with a 3090ti?

If not, what are the system requirements?

为什么有时候输出到代码写完就不继续了

Issues installing flash_attention==2.0.1, stuck & also freezes screen on local computer

Hi thanks for the great work.

I am very keen to explore the work by installing the packages in requirements.txt
but I encountered many problems

pip install 'flash-attn==2.0.1' and also
MAX_JOBS=4 pip install 'flash-attn==2.0.1' --no-build-isolation

does not work, it keeps getting stuck at "Building wheels for collected packages: flash-attn"

However MAX_JOBS=4 pip install 'flash-attn>=2.0.1' --no-build-isolation works but uses the latest version, may I check if the version for flash-attn 2.3.6 works for the ToRA package usage?

Thanks

Request for ToRA Self-Consistency Results with k=10 and k=20

Due to resource constraints, I am particularly interested in comparing the performance of the ToRA-code with self-consistency for k=10 and k=20. Could you kindly provide the results for these values of k on model sizes 7B, 13B, and 34B?
Thank you very much for your assistance.

Thank you for your wonderful work. Would you share codes for querying GPT4 ?

About the Imitation Learning in the paper

Hello, thank you for your contributions! ToRA is really a fantastic work and I've learned so much from it!

Well, I have a question about the term “imitation learning" in your paper. Given a problem as the source sequence and the final $\tau$ as the target sequence, is there something different between the "imitation learning" and the common seq2seq training for decoder-based LLMs?

I guess it is purely in the output $o_i$, which is not predicted or generated by the LLMs, as a similar work MathCoder mentioned in its Section 2.2

However, I did not see the same description in your paper and did not find you extracting and masking the output part in your finetuning script. Could you please explain it?

ToRA-Corpus 16k 数据集的格式

请问ToRA-Corpus 16k数据集是什么样的格式

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

## Question : ToRA 70B Training environments?

Hello. For the ToRA70B model, is it correct that you trained in the A100 80GB x 8 environment? If not, can I know the environment?

Will the tora dataset be public?

Hello, thanks for your excellent job. I want to train ToRA by myself with the code datasets. Will the dataset be public? If it will be public, when can I get it? Thank you again.

The script about the data examples

Could you share the script about how to generate the data examples at https://github.com/microsoft/ToRA/tree/main/data/tora/?

How to generate ToRA Corpus based on scripts/infer_api.sh ?

Hi,

Impressive work!

However, I meet some problems in generating the data. When executing scripts/infer_api.sh, I encountered an error thrown from https://github.com/microsoft/ToRA/blob/main/src/infer/inference_api.py#L17. Where is the api file? Am I missing anything? Or should I modify the source code to use the official openai API?

Thank you very much! Looking forward to your reply. @ZubinGou

Is ToRA-69k, the SFT dataset, open-sourced?

Hello. I've read your Rho-1 paper and noticed the reference to ToRA-69k, but cannot find a download link. Is the dataset open-source, or have I missed it? Thanks in advance.

Could you please provide the results based on more powerful base model llemma_7b?

Could you please provide the results based on more powerful base model llemma_7b?
I see the results "ToRA-Code + DAPT” in the openreview pape, https://openreview.net/forum?id=Ep0TtjVoap, which is 49.5 on MATH. Is that the results fintuned from llemma_7b?

How to create ToRA-format dataset ?

          Currently, the ToRA-Corpus 16k dataset is under review by the company and not immediately available. However, our codes are public, enabling you to create your own ToRA-format dataset and train your ToRA models. A 10k dataset should yield decent performance.

Originally posted by @ZubinGou in #7 (comment)
Hello, thank you for your nice code. Could you explain how to create ToRA-format dataset using code in this repo?

请问有propmt GPT按照TORA流程解决问题的代码吗？

training issues for the model weights

Thanks for your excellent work? I noticed that your train scripts employed the "CodeLlama-${MODEL_SIZE}-Python-hf" weights. I also constructed a small dataset for training but I used your "llm-agents/tora-code-7b-v1.0" model weights to continue finetune on my data. But when I inference with the finetuned weights. There seems to be wrong and the generated output is "". So I wonder if the training script should be modified if I wanted to use the tora-code-7b-v1.0 weights. Could you help me? Thanks a lot!

Clarification and Issues Reproducing ToRA-Code-34b Results

Issue Description

Hello, I am currently working on reproducing the results of ToRA, specifically with the ToRA-Code-34b model, but I am facing some challenges with precision and memory requirements. I would greatly appreciate some clarifications and assistance.

Reproduction Steps and Challenges

Data Distillation: I distilled approximately 16,000 samples from ToRA-Code-34b by sampling 10 trajectories per question and removing incorrect trajectories.
Precision Issue: Despite following the above approach, the precision of my results does not match the reported results in your documentation. Could there be any specific steps or settings I might be missing?

Questions Regarding Fine-Tuning and Memory Requirements

DeepSpeed Configuration: Can you confirm if the DeepSpeed settings used for fine-tuning ToRA-Code-34b are exactly as specified in this configuration?
Memory Requirements: How much memory would finetuning 34B model require? I estimate it would require around 1TB of memory, which I only have around 250GB of memory. I therefore used 8bit-optimizer for finetuning. May I ask the machine configuration for training ToRA.

Clarification on Base Model

Could you please confirm whether ToRA-Code is fine-tuned on top of CodeLlama-Python or Code-Instruct?

Questions About Multi-Step Scenarios

Similar to this comment ICLR 2024 we also observed a lack of multi-step or looping scenarios in the actual data.

Thank you for your time and assistance in addressing these questions. Your insights will be invaluable in helping me accurately reproduce and understand the ToRA-Code-34b model.

Fail to reproduce the codellama baseline with the code base

Hello, thanks for sharing the great work!

But we tried to reproduce the codellama-13b-pal results on GSM-Hard dataset, with the following script:

set -ex

MODEL_NAME_OR_PATH="LOCAL CODELLAMA-13B MODEL WEIGHTS"

# DATA_LIST = ['math', 'gsm8k', 'gsm-hard', 'svamp', 'tabmwp', 'asdiv', 'mawps']

DATA_NAME="gsm-hard"

SPLIT="test"
PROMPT_TYPE="pal"
NUM_TEST_SAMPLE=-1

CUDA_VISIBLE_DEVICES=0 TOKENIZERS_PARALLELISM=false \
python -um infer.inference \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--seed 0 \
--temperature 0 \
--n_sampling 1 \
--top_p 1 \
--start 0 \
--end -1 \

The output results are as follows:

Num samples: 1319
Num scores: 1319
Timeout samples: 10
Empty samples: 293
Mean score: [0.1]

Time use: 7540.60s
Time use: 125:40

It shows the result is 0.1, which has large gaps with the reported one. Would you please share the baseline script on CodeLLaMA with PAL strategy? Thanks a lot.

Typo in README

https://github.com/microsoft/ToRA#-inference

bash scritps/infer.sh -> bash scripts/infer.sh

when the data will be open-sourced?

What are the temperatures used for nucleus sampling in data synthesis?

The temperatures seem not mentioned neither in the paper or the OpenRiview.

Hyperparameter of Self-consistency?

Hello, thank you for the good research.
Can I know how to set hyperparameters such as Temperature and top_p when inference SC=50?

Thank you for your excellent work!

Training Data Format

Can you show the training data format which is used to train llama. If possible, I would like you to show me a example training data so that we can build our own training data set.

Reproducing ToRA Quoted Results

Hi,

Today I executed the following steps:

1. modified infer.sh to point to ToRA code, as shown
  MODEL_NAME_OR_PATH="llm-agents/tora-code-7b-v1.0"
1. Ran ./scripts/infer.sh
1. Final Eval output was seen as shown below:

Num samples: 5000
Num scores: 5000
Timeout samples: 1
Empty samples: 8
Mean score: [29.1]
Type scores: {'Algebra': 41.2, 'Counting & Probability': 22.8, 'Geometry': 14.6, 'Intermediate Algebra': 22.1, 'Number Theory': 30.0, 'Prealgebra': 36.7, 'Precalculus': 19.8}

This does not seem to agree with the ~45% quoted in the paper and on the papers with code webpage. Why the large discrepancy, how can I reproduce the quoted numbers?

Edit, here are some relevant environment variables

>>> torch.__version__
'2.0.1+cu117'
>>> transformers.__version__
'4.35.0'
>>> vllm.__version__
'0.2.1.post1'

EDIT:

I think it might just have been the vllm mismatch that drove this difference, as I can replicate the benchmark now after doing pip install -r requirements.txt to sync

Num samples: 5000
Num scores: 5000
Timeout samples: 2
Empty samples: 1
Mean score: [44.6]
Type scores: {'Algebra': 61.2, 'Counting & Probability': 31.0, 'Geometry': 22.3, 'Intermediate Algebra': 36.2, 'Number Theory': 51.1, 'Prealgebra': 55.5, 'Precalculus': 29.7}

error when loading transformers library from requirements

There is a mistake in src/requirements.txt:

transformer==4.31.0

and it should be:

transformers==4.31.0