Performance on ProjDevBench across six coding agents and multiple LLM backends. Exec. represents the execution score from Online Judge, CR represents the code review score, and Final is the weighted combination (80% Exec. + 20% CR).
Task comparison: Unlike benchmarks where coding agents modify code snippets from pre-existing codebases, ProjDevBench evaluates end-to-end repository construction from project-level requirements.
Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.
Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends.
Overview of the ProjDevBench evaluation pipeline.
Performance on ProjDevBench across six coding agents and multiple LLM backends. Exec. represents the execution score from Online Judge, CR represents the code review score, and Final is the weighted combination (80% Exec. + 20% CR).
| Rank 排名 | Agent 智能体 | Model 模型 | Easy Exec. 简单执行 | Easy CR 简单CR | Hard Exec. 困难执行 | Hard CR 困难CR | Overall Exec. 总执行 | Overall CR 总CR | Final ↓ 最终 ↓ |
|---|---|---|---|---|---|---|---|---|---|
| 🥇 | Codex | GPT-5 | 79.24 | 82.11 | 69.22 | 82.90 | 76.73 | 82.31 | 77.85 |
| 🥈 | Cursor | Gemini-3-Pro-Preview | 72.87 | 88.67 | 71.47 | 80.03 | 72.52 | 86.51 | 75.32 |
| 🥉 | Augment | GPT-5 | 77.10 | 76.00 | 57.22 | 65.03 | 72.13 | 73.26 | 72.35 |
| 4 | Cursor | GPT-5 | 69.74 | 80.56 | 67.80 | 87.27 | 69.26 | 82.23 | 71.85 |
| 5 | Cursor | Sonnet-4.5 | 71.12 | 85.67 | 60.17 | 66.47 | 68.39 | 80.87 | 70.88 |
| 6 | Augment | Sonnet-4.5 | 69.14 | 92.56 | 56.81 | 67.43 | 66.06 | 86.28 | 70.10 |
| 7 | Claude Code | Sonnet-4.5 | 66.85 | 92.89 | 54.47 | 78.57 | 63.76 | 89.31 | 68.87 |
| 8 | Gemini CLI | Gemini-3-Pro-Preview | 74.57 | 80.33 | 35.53 | 94.20 | 64.81 | 83.80 | 68.61 |
| 9 | GitHub Copilot | Sonnet-4.5 | 71.10 | 87.89 | 36.63 | 80.23 | 62.48 | 85.97 | 67.18 |
| 10 | Codex | Sonnet-4.5 | 66.07 | 68.22 | 31.88 | 83.23 | 57.52 | 71.98 | 60.41 |
Analysis of submission outcomes across all agents reveals that only 27.38% of submissions were accepted, with the majority failing due to wrong answers (41.86%) or time limit violations (13.91%).
| Status Type | Count | Percentage |
|---|---|---|
| Accepted | 484 | 27.38% |
| Wrong Answer | 740 | 41.86% |
| Time Limit Exceeded | 246 | 13.91% |
| Runtime Error | 124 | 7.01% |
| Compile Error | 80 | 4.52% |
| Memory Leak | 62 | 3.51% |
| Memory Limit Exceeded | 24 | 1.36% |
| Others | 8 | 0.45% |
ProjDevBench contains 20 problems across 8 categories. Easy problems provide a partial codebase (project-completion), while Hard problems require from-scratch construction (project-creation).
Distribution of ProjDevBench tasks across 8 categories.
| ID | Problem Name | Category | Difficulty | Time Limit | Memory Limit | Avg Score |
|---|---|---|---|---|---|---|
| 001 | A+B Problem | Algorithm | Easy | 1s | 256 MiB | 54.37 |
| 002 | int2048 Big Integer | Algorithm | Easy | 10s | 190 MiB | 48.19 |
| 003 | ICPC Management System | Management | Hard | 2s | 512 MiB | 52.07 |
| 004 | Bookstore System | Management | Hard | 10s | 64 MiB | 36.29 |
| 005 | QOI Format Codec | Algorithm | Easy | 10s | 512 MiB | 58.87 |
| 006 | Minesweeper | Game | Easy | 30s | 256 MiB | 53.51 |
| 007 | BASIC Interpreter | Interpreter | Easy | 5s | 256 MiB | 47.67 |
| 008 | MOV Language | Assembly | Easy | - | - | 54.70 |
| 009 | STLite Vector | Data Structure | Easy | 100s | 768 MiB | 58.46 |
| 010 | STLite List | Data Structure | Easy | 25s | 768 MiB | 30.76 |
| 011 | STLite Priority Queue | Data Structure | Easy | 15s | 512 MiB | 57.25 |
| 012 | STLite Linked HashMap | Data Structure | Easy | 24s | 893 MiB | 43.36 |
| 013 | STLite Map | Data Structure | Easy | 30s | 893 MiB | 58.21 |
| 014 | Python Interpreter | Interpreter | Easy | 16s | 512 MiB | 46.23 |
| 015 | File Storage | Storage | Hard | 16s | 6 MiB | 42.71 |
| 016 | File Storage BPT | Storage | Hard | 5s | 64 MiB | 40.11 |
| 017 | Train Ticket System | Management | Hard | 40s | 47 MiB | 53.24 |
| 018 | Scheme Interpreter | Interpreter | Easy | 1.5s | 244 MiB | 32.94 |
| 019 | GPU Memory Optimization | Optimization | Easy | 1s | 244 MiB | 36.89 |
| 020 | Buddy Algorithm | Optimization | Easy | 10s | 244 MiB | 33.33 |
Difficulty Definition: Easy = Project-completion (partial codebase provided), Hard = Project-creation (from-scratch construction)
ProjDevBench adopts a dual evaluation protocol that distinguishes between hard functional correctness and compliance at the rule level and specification level.
Final Score = 0.8 × Execution Score + 0.2 × Code Review Score
Prioritizes functional correctness while penalizing specification violations.
Agents frequently generate syntactically correct frameworks but omit critical business logic, fail to distinguish development from submission contexts.
Systematic weaknesses in boundary condition handling, leading to Wrong Answer and Runtime Error failures including null pointer dereferences.
Agents favor familiar but suboptimal patterns, using O(log N) maps where O(1) hash tables suffice, leading to TLE submissions.
Significant limitations in exception safety and memory management, preferring manual new/delete over RAII patterns, causing memory leaks.
If you have any questions regarding ProjDevBench, feel free to reach out to us via email at lupengrui@sjtu.edu.cn, or directly submit a GitHub issue.
If you find ProjDevBench useful for your research, please consider citing our paper:
@misc{lu2026projdevbenchbenchmarkingaicoding,
title={ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development},
author={Pengrui Lu and Shiqi Zhang and Yunzhong Hou and Lyumanshan Ye and Chaoyi Huang and Zixi Chen and Ji Zeng and Hantao Jiang and Pengfei Liu and Yiwei Wang and Ming-Hsuan Yang},
year={2026},
eprint={2602.01655},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.01655},
}