Giter Club home page Giter Club logo

tora's Introduction

ToRA
ToRA: A Tool-Integrated Reasoning Agent


PWC

[🌐 Website][📜 Paper][🤗 HF Models][🐱 GitHub]
[🐦 Twitter][💬 Reddit][🍀 Unofficial Blog]

Repo for "ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving" [ICLR'2024]


Figure 1: Comparing ToRA with baselines on LLaMA-2 base models from 7B to 70B.

🔥 News

  • [2023/10/08] 🔥🔥🔥 All ToRA models released at 🤗 HuggingFace!
  • [2023/09/29] ToRA paper, repo, and website released.

💡 Introduction

ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical reasoning problems by interacting with tools, e.g., computation libraries and symbolic solvers. ToRA series seamlessly integrate natural language reasoning with the utilization of external tools, thereby amalgamating the analytical prowess of language and the computational efficiency of external tools.

Model Size GSM8k MATH AVG@10 math tasks
GPT-4 - 92.0 42.5 78.3
GPT-4 (PAL) - 94.2 51.8 86.4
ToRA ToRA-7B 7B 68.8 40.1 62.4
ToRA ToRA-Code-7B 7B 72.6 44.6 66.5
ToRA-Code-7B + self-consistency (k=50) 7B 76.8 52.5 -
ToRA ToRA-13B 13B 72.7 43.0 65.9
ToRA ToRA-Code-13B 13B 75.8 48.1 71.3
ToRA-Code-13B + self-consistency (k=50) 13B 80.4 55.1 -
ToRA ToRA-Code-34B* 34B 80.7 51.0 74.8
ToRA-Code-34B + self-consistency (k=50) 34B 85.1 60.0 -
ToRA ToRA-70B 70B 84.3 49.7 76.9
ToRA-70B + self-consistency (k=50) 70B 88.3 56.9 -
  • *ToRA-Code-34B is currently the first and only open-source model to achieve over 50% accuracy (pass@1) on the MATH dataset, which significantly outperforms GPT-4’s CoT result (51.0 vs. 42.5), and is competitive with GPT-4 solving problems with programs. By open-sourcing our codes and models, we hope more breakthroughs will come!

  • 10 math tasks include GSM8k, MATH, GSM-Hard, SVAMP, TabMWP, ASDiv, SingleEQ, SingleOP, AddSub, and MultiArith.

Tool-Integrated Reasoning


Figure 2: A basic example of single-round tool interaction, which interleaves rationales with program-based tool use.

ToRA Training Pipeline


Figure 3: Training ToRA contains ① Imitation Learning, and ② output space shaping.

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. We use vLLM (0.1.4) to accelerate inference. Run the following commands to setup your environment:

git clone https://github.com/microsoft/ToRA.git && cd ToRA/src
conda create -n tora python=3.10
conda activate tora
pip install packaging==22.0
pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8 for example
pip install -r requirements.txt

🪁 Inference

We provide a script for inference, simply config the MODEL_NAME_OR_PATH and DATA in src/scripts/infer.sh and run the following command:

bash scritps/infer.sh

We also open-source the model outputs from our best models (ToRA-Code-34B and ToRA-70B) in the src/outputs/ folder.

⚖️ Evaluation

The src/eval/grader.py file contains the grading logic that assesses the accuracy of the predicted answer by comparing it to the ground truth. This logic is developed based on the Hendrycks' MATH grading system, which we have manually verified on the MATH dataset to minimize false positives and false negatives.

To evaluate the predicted answer, run the following command:

python -m eval.evaluate \
    --data_name "math" \
    --prompt_type "tora" \
    --file_path "outputs/llm-agents/tora-code-34b-v1.0/math/test_tora_-1_seed0_t0.0_s0_e5000.jsonl" \
    --execute

then you will get:

Num samples: 5000
Num scores: 5000
Timeout samples: 0
Empty samples: 2
Mean score: [51.0]
Type scores: {'Algebra': 67.3, 'Counting & Probability': 42.2, 'Geometry': 26.1, 'Intermediate Algebra': 40.0, 'Number Theory': 59.3, 'Prealgebra': 63.8, 'Precalculus': 34.2}

⚡️ Training

We're currently undergoing an internal review to open-source the ToRA-Corpus-16k, stay tuned! We also open-source our complete training scripts for the community, and you may construct your own dataset for training. We provide some example data in data/tora/.

To train a model, run the following command:

bash scripts/train.sh codellama 7b

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@inproceedings{
gou2024tora,
title={To{RA}: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving},
author={Zhibin Gou and Zhihong Shao and Yeyun Gong and yelong shen and Yujiu Yang and Minlie Huang and Nan Duan and Weizhu Chen},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=Ep0TtjVoap}
}

🍀 Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

🌟 Star History

Star History Chart

tora's People

Contributors

microsoft-github-operations[bot] avatar microsoftopensource avatar zhihongshao avatar zubingou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tora's Issues

Issues installing flash_attention==2.0.1, stuck & also freezes screen on local computer

Hi thanks for the great work.

I am very keen to explore the work by installing the packages in requirements.txt
but I encountered many problems

pip install 'flash-attn==2.0.1' and also
MAX_JOBS=4 pip install 'flash-attn==2.0.1' --no-build-isolation

does not work, it keeps getting stuck at "Building wheels for collected packages: flash-attn"

However MAX_JOBS=4 pip install 'flash-attn>=2.0.1' --no-build-isolation works but uses the latest version, may I check if the version for flash-attn 2.3.6 works for the ToRA package usage?

Thanks

Request for ToRA Self-Consistency Results with k=10 and k=20

Due to resource constraints, I am particularly interested in comparing the performance of the ToRA-code with self-consistency for k=10 and k=20. Could you kindly provide the results for these values of k on model sizes 7B, 13B, and 34B?
Thank you very much for your assistance.

About the Imitation Learning in the paper

Hello, thank you for your contributions! ToRA is really a fantastic work and I've learned so much from it!

Well, I have a question about the term “imitation learning" in your paper. Given a problem as the source sequence and the final $\tau$ as the target sequence, is there something different between the "imitation learning" and the common seq2seq training for decoder-based LLMs?

I guess it is purely in the output $o_i$, which is not predicted or generated by the LLMs, as a similar work MathCoder mentioned in its Section 2.2
image

However, I did not see the same description in your paper and did not find you extracting and masking the output part in your finetuning script. Could you please explain it?

ToRA-Corpus 16k 数据集的格式

请问ToRA-Corpus 16k数据集是什么样的格式

Tasks

No tasks being tracked yet.

Will the tora dataset be public?

Hello, thanks for your excellent job. I want to train ToRA by myself with the code datasets. Will the dataset be public? If it will be public, when can I get it? Thank you again.

Is ToRA-69k, the SFT dataset, open-sourced?

Hello. I've read your Rho-1 paper and noticed the reference to ToRA-69k, but cannot find a download link. Is the dataset open-source, or have I missed it? Thanks in advance.

How to create ToRA-format dataset ?

          Currently, the ToRA-Corpus 16k dataset is under review by the company and not immediately available. However, our codes are public, enabling you to create your own ToRA-format dataset and train your ToRA models. A 10k dataset should yield decent performance.

Originally posted by @ZubinGou in #7 (comment)
Hello, thank you for your nice code. Could you explain how to create ToRA-format dataset using code in this repo?

training issues for the model weights

Thanks for your excellent work? I noticed that your train scripts employed the "CodeLlama-${MODEL_SIZE}-Python-hf" weights. I also constructed a small dataset for training but I used your "llm-agents/tora-code-7b-v1.0" model weights to continue finetune on my data. But when I inference with the finetuned weights. There seems to be wrong and the generated output is "". So I wonder if the training script should be modified if I wanted to use the tora-code-7b-v1.0 weights. Could you help me? Thanks a lot!

Clarification and Issues Reproducing ToRA-Code-34b Results

Issue Description

Hello, I am currently working on reproducing the results of ToRA, specifically with the ToRA-Code-34b model, but I am facing some challenges with precision and memory requirements. I would greatly appreciate some clarifications and assistance.

Reproduction Steps and Challenges

  • Data Distillation: I distilled approximately 16,000 samples from ToRA-Code-34b by sampling 10 trajectories per question and removing incorrect trajectories.
  • Precision Issue: Despite following the above approach, the precision of my results does not match the reported results in your documentation. Could there be any specific steps or settings I might be missing?

Questions Regarding Fine-Tuning and Memory Requirements

  1. DeepSpeed Configuration: Can you confirm if the DeepSpeed settings used for fine-tuning ToRA-Code-34b are exactly as specified in this configuration?
  2. Memory Requirements: How much memory would finetuning 34B model require? I estimate it would require around 1TB of memory, which I only have around 250GB of memory. I therefore used 8bit-optimizer for finetuning. May I ask the machine configuration for training ToRA.

Clarification on Base Model

  • Could you please confirm whether ToRA-Code is fine-tuned on top of CodeLlama-Python or Code-Instruct?

Questions About Multi-Step Scenarios

  • Similar to this comment ICLR 2024 we also observed a lack of multi-step or looping scenarios in the actual data.

Thank you for your time and assistance in addressing these questions. Your insights will be invaluable in helping me accurately reproduce and understand the ToRA-Code-34b model.

Fail to reproduce the codellama baseline with the code base

Hello, thanks for sharing the great work!

But we tried to reproduce the codellama-13b-pal results on GSM-Hard dataset, with the following script:

set -ex

MODEL_NAME_OR_PATH="LOCAL CODELLAMA-13B MODEL WEIGHTS"

# DATA_LIST = ['math', 'gsm8k', 'gsm-hard', 'svamp', 'tabmwp', 'asdiv', 'mawps']

DATA_NAME="gsm-hard"

SPLIT="test"
PROMPT_TYPE="pal"
NUM_TEST_SAMPLE=-1

CUDA_VISIBLE_DEVICES=0 TOKENIZERS_PARALLELISM=false \
python -um infer.inference \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--seed 0 \
--temperature 0 \
--n_sampling 1 \
--top_p 1 \
--start 0 \
--end -1 \

The output results are as follows:

Num samples: 1319
Num scores: 1319
Timeout samples: 10
Empty samples: 293
Mean score: [0.1]

Time use: 7540.60s
Time use: 125:40

It shows the result is 0.1, which has large gaps with the reported one. Would you please share the baseline script on CodeLLaMA with PAL strategy? Thanks a lot.

Training Data Format

Can you show the training data format which is used to train llama. If possible, I would like you to show me a example training data so that we can build our own training data set.

Reproducing ToRA Quoted Results

Hi,

Today I executed the following steps:

    1. modified infer.sh to point to ToRA code, as shown
      MODEL_NAME_OR_PATH="llm-agents/tora-code-7b-v1.0"
    1. Ran ./scripts/infer.sh
    1. Final Eval output was seen as shown below:
Num samples: 5000
Num scores: 5000
Timeout samples: 1
Empty samples: 8
Mean score: [29.1]
Type scores: {'Algebra': 41.2, 'Counting & Probability': 22.8, 'Geometry': 14.6, 'Intermediate Algebra': 22.1, 'Number Theory': 30.0, 'Prealgebra': 36.7, 'Precalculus': 19.8}

This does not seem to agree with the ~45% quoted in the paper and on the papers with code webpage. Why the large discrepancy, how can I reproduce the quoted numbers?

Edit, here are some relevant environment variables

>>> torch.__version__
'2.0.1+cu117'
>>> transformers.__version__
'4.35.0'
>>> vllm.__version__
'0.2.1.post1'

EDIT:

I think it might just have been the vllm mismatch that drove this difference, as I can replicate the benchmark now after doing pip install -r requirements.txt to sync

Num samples: 5000
Num scores: 5000
Timeout samples: 2
Empty samples: 1
Mean score: [44.6]
Type scores: {'Algebra': 61.2, 'Counting & Probability': 31.0, 'Geometry': 22.3, 'Intermediate Algebra': 36.2, 'Number Theory': 51.1, 'Prealgebra': 55.5, 'Precalculus': 29.7}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.