Giter Club home page Giter Club logo

devbench's Introduction

DevBench: Towards LLMs based Automated Software Development

👋 Overview | 📖 Benchmarking | ⚙️ Setup | 🚀 Usage | 🔎 Citation | 📄 License

📬 Contact: [email protected], [email protected]

📝 Check out our paper HERE !

👋 Overview

  • DevBench is a comprehensive benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.

  • The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.

  • DevBench includes a comprehensive and automatic evaluation suite for all tasks involved. We provide extensive acceptance and unit test cases for the implementation task 🤗. Additionally, we utilize LLM-as-a-Judge for evaluating the software design task 👩🏽‍⚖️. Further details on our task specifications can be found here.

  • We have developed a baseline agent system based on the popular multi-agent software development system, ChatDev. Special thanks to our collaborators at ChatDev!

📖 Benchmarking Code LLMs

Evaluation results of the coding tasks on DevBench.

Model Environment Setup Implementation Acceptance Testing Unit Testing
Pass@ Example Usage§ Pass@ Accept. Test¶ Pass@ Unit Test¶ Oracle Test§ Oracle Test§ Coverage$
GPT-3.5-Turbo 33.3 4.2 4.3 11.7 28.7 24.6(61.4)
GPT-4-Turbo-1106 41.7 6.9 6.8 25.9 33.6 36.7(66.7)
GPT-4-Turbo-0125 41.7 7.1 8.0 29.2 36.5 33.2(66.3)
CodeLlama-7B-Instruct 8.3 0.0 0.0 0.0 3.0 3.6(71.0)
CodeLlama-13B-Instruct 25.0 0.6 0.0 0.0 5.1 8.6(57.6)
CodeLlama-34B-Instruct 16.7 0.6 0.5 4.5 21.1 25.4(72.6)
DeepSeek-Coder-1.3B-Instruct 8.3 0.0 0.1 0.0 5.6 2.7(27.0)
DeepSeek-Coder-6.7B-Instruct 25.0 2.9 3.9 20.5♡ 23.5 28.2(70.6)
DeepSeek-Coder-33B-Instruct 16.7 4.4 5.5 13.6 32.8 35.7(79.4)
Italic figures: test cases for the Environment Setup task are scarce compared to other tasks, therefore the results are more influenced by the randomness. §: all results are averaged across all repositories and weighted uniformly. ¶: all results are averaged across all repositories and weighted by the number of code lines. $: the results on the left side are averaged across all repositories and weighted uniformly, showing the overall scores. The results on the right side in the parenthesis are averaged across all valid repositories and weighted uniformly, where models have generated executable testing code. ♡: the model has generated meaningless but executable testing code.

Evaluation results of the software design on DevBench.

The code for the software design evaluation can be found here👩🏽‍⚖️.

Model w/ Tie w/o Tie
General Principles† Faithfulness‡ General Principles Faithfulness
GPT-4-Turbo-0125 97.9 97.9 100.0 100.0
GPT-4-Turbo-1106 91.7 85.4 100.0 100.0
CodeLlama-7B-Instruct 4.2 8.3 4.2 4.5
CodeLlama-13B-Instruct 18.8 14.6 10.5 5.3
CodeLlama-34B-Instruct 39.6 33.3 33.3 21.4
DeepSeek-Coder-1.3B-Instruct 16.7 16.7 5.5 5.6
DeepSeek-Coder-6.7B-Instruct 35.4 35.4 31.6 29.4
DeepSeek-Coder-33B-Instruct 52.1 50.0 53.8 50.0
Agree w/ Human Majority 60.4 51.6 79.2 83.2
Win rate of pairwise comparison against GPT-3.5-Turbo on Software Desgin on a subset of DevBench where results are averaged across different repositories and sub-tasks uniformly.†: the general principles metric. ‡: the faithfulness metric. w/ Tie: inconsistent results are considered as a tie. We also report agreement with Human Majority.

🐳 Set Up with Docker

For a secure and isolated environment, we offer Docker support for DevBench. Please refer to our detailed Installation Guide.

🚀 Usage

1. Prepare the environment variables

Add your DevBench directory to your PYTHONPATH variable.

export PYTHONPATH="${PYTHONPATH}:${path_to_devbench}"

For running the benchmark_data/java/Actor_relationship_game repo, configure your TMDB key.

export TMDB_API_KEY=${your_TMDB_key}

2. Prepare the chat models

OpenAI GPT models

Set your OpenAI API key as an environment variable.

export OPENAI_API_KEY="your_OpenAI_API_key"

Open source models

For deploying open source models, please refer to lmdeploy or vllm.

After the deployment, please configure the IP address in open_source_model.json.

For codellama and deepseek-coder models, which are integrated into our experiments, simply fill in the IP address in {"model_name": $model_ip_address}.

For example:

{
  "codellama-7b-instruct": "",
  "codellama-13b-instruct": "",
  "codellama-34b-instruct": "",
  "deepseek-coder-1.3b-instruct": "",
  "deepseek-coder-6.7b-instruct": "",
  "deepseek-coder-33b-instruct": "$model_ip_address"
}

For additional models, add a new field as shown below.

{
  "customized-model": {"$model_name": "$model_ip_address"}
}

3. Run the agent system

Run script

cd agent_sysyem/baseline
python run.py --config Implementation --input_path ../../benchmark_data/python/TextCNN/ --model gpt-4-turbo-new --model_source openai  --review execution --evaluate

Parameters

  • config (str) - Specifies the task in the DevBench: SoftwareDesign | EnvironmentSetup | Implementation | AcceptanceTesting | UnitTesting.
  • input_path (str) - Specifies the repo path.
  • project_name (str) - Specifies the repo name. If empty, defaults to the last segment of input_path (i.e., input_path.split('/')[-1])
  • model (str) - Specifies the name of the language model: gpt-3.5-turbo | gpt-4 | gpt-4-32k | gpt-4-turbo | claude-2 | claude-2.1 | codellama-7b | codellama-13b | codellama-34b | deepseek-coder-1.3b | deepseek-coder-6.7b | deepseek-coder-33b | customized-model.
  • customized_model_name (Optional, str) - Specifies the custom model name if the value of the model parameter is customized-model.
  • model_source (str) - Specifies the model type, open source model or openai closed source model : open_sourceopenai
  • review (str) - Specifies the review mode: none | normal | execution.
    • none: a single forward pass of Coding.
    • normal: Coding and CodeReview in alternation, with CodeReview lacking program execution feedback.
    • execution: Coding and CodeReview in alternation, with CodeReview including program execution feedback.
  • read_src_code (bool) - Whether to read source code in the AcceptanceTesting and UnitTesting tasks.
  • evaluate (bool) - Whether to evaluate in the end. The evaluation for the software design can be found here.
  • temperature (float) - temperature
  • top_p (float) - top_p

When you use normal review and execution review, the cyclenum parameter of CompanyConfig/{task_name}/ChatChainConfig.json can be specified as the number of rounds of review. The default is 2.

🔎 Citation

@article{li2024devbench,
  title={DevBench: A Comprehensive Benchmark for Software Development},
  author={Li, Bowen and Wu, Wenhan and Tang, Ziwei and Shi, Lin and Yang, John and Li, Jinyang and Yao, Shunyu and Qian, Chen and Hui, Binyuan and Zhang, Qicheng and others},
  journal={arXiv preprint arXiv:2403.08604},
  year={2024}
}

📄 License

  • Source Code Licensing: Our project's source code is licensed under the Apache 2.0 License. This license permits the use, modification, and distribution of the code, subject to certain conditions outlined in the Apache 2.0 License.
  • Data Licensing: The related data utilized in our project is licensed under CC BY 4.0, which allows anyone to copy, distribute, transmit, adapt and make commercial use of the dataset.

devbench's People

Contributors

libowen2121 avatar zhimin-z avatar wenhan-wu-stephen avatar chao-peng avatar kinza99 avatar slimshilin avatar bytedance-se-research avatar skycat8 avatar tzw2698 avatar

Stargazers

Zakhar Davydov avatar Ren Ma avatar  avatar yingweima2022 avatar Jiabao Ji avatar siwei avatar Vincent_Li avatar  avatar Alpha4AGI avatar Franklin avatar dreamHuang avatar loveSnowBest avatar whalexiao avatar Zhang avatar Ziqin Zhu avatar  avatar Zac Z. avatar tenderzada avatar Zephan avatar Lei Zhang avatar Seungyun Baek avatar Huaizheng Zhang avatar Tianyu Liu avatar Constance Brody avatar  avatar tsmile avatar song avatar Chan Chi-Min avatar  avatar  avatar TechxGenus avatar  avatar Jian Liao avatar Anton Repushko avatar shellhue avatar ncwr avatar Shanpeng Sun avatar Gary Feng avatar Liang7 avatar Ido Rosen avatar  avatar  avatar Yufan Dang avatar Rui Lu avatar chaochao avatar Corey Dolik avatar Jeff Carpenter avatar Shuyan Zhou avatar klein avatar  avatar Mo Li avatar glh123456 avatar Sergey V. Kovalchuk avatar orhan erdal avatar ChiYeung Law avatar Taicheng Guo avatar  avatar qicheng avatar Jeffreychen avatar Tianshu Wang avatar Jeffrey Jiang avatar hongv avatar Li Junming avatar  avatar  avatar Dongming avatar ja19920209 avatar Zachary Kirby avatar Chen Qian avatar 唐国梁Tommy avatar Wei Tao avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar Chang Gao avatar  avatar

Watchers

Songyang Zhang avatar  avatar  avatar

devbench's Issues

The tutorial seems incomplete, how to evaluate it?

hi,according to the tutorial, execute the following command, no error, freeze here. What should I do in the next step? If you can improve this tutorial, thank you.
python run.py --config Implementation --input_path ../../benchmark_data/python/TextCNN/ --model deepseek-coder-6.7b --model_source open_source --review execution --evaluate
图片1

The configuration file agent_system/baseline/open_source_config.json is as follows:
{ "codellama-7b-instruct": "", "codellama-13b-instruct": "", "codellama-34b-instruct": "", "deepseek-coder-1.3b-instruct": "", "deepseek-coder-6.7b-instruct": "http://XXXX:7001/v1", "deepseek-coder-33b-instruct": "", "customized-model": {"deepseek-coder-6.7b-instruct": "model_ip_address"} }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.