ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Task comparison: Unlike benchmarks where coding agents modify code snippets from pre-existing codebases, ProjDevBench evaluates end-to-end repository construction from project-level requirements. 任务对比：与从现有代码库修改代码片段的基准不同，ProjDevBench评估从项目级需求进行端到端仓库构建的能力。

20

Problems

8

Abstract 摘要

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.

Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends.

Overview of the ProjDevBench evaluation pipeline. ProjDevBench评估流程概览。

Key Features 主要特点

End-to-End Project Construction: Agents build complete repositories from scratch, not just patches or single files.
Dual Evaluation Protocol: Combines Online Judge execution-based testing with LLM-assisted code review for comprehensive assessment.
Diagnostic Feedback: Fine-grained verdict-level signals (Wrong Answer, TLE, MLE, Runtime Error, etc.) enable systematic failure analysis.
Multi-Agent Support: Evaluate Claude Code, Cursor, Gemini CLI, Codex, Augment, and GitHub Copilot.
Extended Interaction: Tasks demand sustained agent-environment interaction, averaging 138 turns and 4.81M tokens per problem.

🏆 Leaderboard 🏆 排行榜

Performance on ProjDevBench across six coding agents and multiple LLM backends. Exec. represents the execution score from Online Judge, CR represents the code review score, and Final is the weighted combination (80% Exec. + 20% CR).

Codex Cursor Augment Claude Code GitHub Copilot Gemini CLI

Rank 排名	Agent 智能体	Model 模型	Easy Exec. 简单执行	Easy CR 简单CR	Hard Exec. 困难执行	Hard CR 困难CR	Overall Exec. 总执行	Overall CR 总CR	Final ↓ 最终 ↓
🥇	Codex	GPT-5	79.24	82.11	69.22	82.90	76.73	82.31	77.85
🥈	Cursor	Gemini-3-Pro-Preview	72.87	88.67	71.47	80.03	72.52	86.51	75.32
🥉	Augment	GPT-5	77.10	76.00	57.22	65.03	72.13	73.26	72.35
4	Cursor	GPT-5	69.74	80.56	67.80	87.27	69.26	82.23	71.85
5	Cursor	Sonnet-4.5	71.12	85.67	60.17	66.47	68.39	80.87	70.88
6	Augment	Sonnet-4.5	69.14	92.56	56.81	67.43	66.06	86.28	70.10
7	Claude Code	Sonnet-4.5	66.85	92.89	54.47	78.57	63.76	89.31	68.87
8	Gemini CLI	Gemini-3-Pro-Preview	74.57	80.33	35.53	94.20	64.81	83.80	68.61
9	GitHub Copilot	Sonnet-4.5	71.10	87.89	36.63	80.23	62.48	85.97	67.18
10	Codex	Sonnet-4.5	66.07	68.22	31.88	83.23	57.52	71.98	60.41

📊 Key Findings 📊 主要发现

Best Overall: Codex + GPT-5 achieves 77.85% final score, leading in execution performance.
Model Impact: GPT-5 generally excels at execution, while Sonnet-4.5 shows stronger code review compliance.
Framework Stability: Cursor and Augment demonstrate stable performance across different base models, with all configurations achieving final scores above 70%.
Hard vs Easy: Performance gaps widen significantly on from-scratch construction tasks (Hard problems).

📈 Submission Status Distribution 📈 提交状态分布

Analysis of submission outcomes across all agents reveals that only 27.38% of submissions were accepted, with the majority failing due to wrong answers (41.86%) or time limit violations (13.91%).

Status Type 状态类型	Count 数量	Percentage 百分比
Accepted	484	27.38%
Wrong Answer	740	41.86%
Time Limit Exceeded	246	13.91%
Runtime Error	124	7.01%
Compile Error	80	4.52%
Memory Leak	62	3.51%
Memory Limit Exceeded	24	1.36%
Others	8	0.45%

📋 Problem Details 📋 题目详情

ProjDevBench contains 20 problems across 8 categories. Easy problems provide a partial codebase (project-completion), while Hard problems require from-scratch construction (project-creation).

Distribution of ProjDevBench tasks across 8 categories. ProjDevBench任务在8个类别中的分布。

ID	Problem Name 问题名称	Category 类别	Difficulty 难度	Time Limit 时间限制	Memory Limit 内存限制	Avg Score 平均分
001	A+B Problem	Algorithm	Easy	1s	256 MiB	54.37
002	int2048 Big Integer	Algorithm	Easy	10s	190 MiB	48.19
003	ICPC Management System	Management	Hard	2s	512 MiB	52.07
004	Bookstore System	Management	Hard	10s	64 MiB	36.29
005	QOI Format Codec	Algorithm	Easy	10s	512 MiB	58.87
006	Minesweeper	Game	Easy	30s	256 MiB	53.51
007	BASIC Interpreter	Interpreter	Easy	5s	256 MiB	47.67
008	MOV Language	Assembly	Easy	-	-	54.70
009	STLite Vector	Data Structure	Easy	100s	768 MiB	58.46
010	STLite List	Data Structure	Easy	25s	768 MiB	30.76
011	STLite Priority Queue	Data Structure	Easy	15s	512 MiB	57.25
012	STLite Linked HashMap	Data Structure	Easy	24s	893 MiB	43.36
013	STLite Map	Data Structure	Easy	30s	893 MiB	58.21
014	Python Interpreter	Interpreter	Easy	16s	512 MiB	46.23
015	File Storage	Storage	Hard	16s	6 MiB	42.71
016	File Storage BPT	Storage	Hard	5s	64 MiB	40.11
017	Train Ticket System	Management	Hard	40s	47 MiB	53.24
018	Scheme Interpreter	Interpreter	Easy	1.5s	244 MiB	32.94
019	GPU Memory Optimization	Optimization	Easy	1s	244 MiB	36.89
020	Buddy Algorithm	Optimization	Easy	10s	244 MiB	33.33

Difficulty Definition: Easy = Project-completion (partial codebase provided), Hard = Project-creation (from-scratch construction) 难度定义：简单 = 项目完善（提供部分代码），困难 = 项目创建（从头构建）

🔬 Evaluation Methodology 🔬 评估方法

ProjDevBench adopts a dual evaluation protocol that distinguishes between hard functional correctness and compliance at the rule level and specification level.

Execution-based Evaluation 基于执行的评估

Submissions evaluated on Online Judge platform
Comprehensive test suites verify functional correctness
Fine-grained verdict signals: CE, RE, WA, TLE, MLE
Weighted partial credit based on test case importance

Code Review 代码审查

Rule-based Python scripts for explicit violations
LLM-based review for specification compliance
Detects forbidden library usage and hack solutions
Assesses adherence to submission requirements

Final Scoring Formula 最终评分公式

Final Score = 0.8 × Execution Score + 0.2 × Code Review Score

Prioritizes functional correctness while penalizing specification violations. 优先考虑功能正确性，同时惩罚规范违规。

🔍 Where End-to-End Coding Agents Fail 🔍 端到端编程智能体的失败之处

Specification Misalignment 规范不匹配

Agents frequently generate syntactically correct frameworks but omit critical business logic, fail to distinguish development from submission contexts.

Edge Case Handling 边界情况处理

Systematic weaknesses in boundary condition handling, leading to Wrong Answer and Runtime Error failures including null pointer dereferences.

Time Complexity Issues 时间复杂度问题

Agents favor familiar but suboptimal patterns, using O(log N) maps where O(1) hash tables suffice, leading to TLE submissions.

Resource Management 资源管理

Significant limitations in exception safety and memory management, preferring manual new/delete over RAII patterns, causing memory leaks.

📬 Contact 📬 联系方式

If you have any questions regarding ProjDevBench, feel free to reach out to us via email at lupengrui@sjtu.edu.cn, or directly submit a GitHub issue.

📝 Citation 📝 引用

If you find ProjDevBench useful for your research, please consider citing our paper:

@misc{lu2026projdevbenchbenchmarkingaicoding,
      title={ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development}, 
      author={Pengrui Lu and Shiqi Zhang and Yunzhong Hou and Lyumanshan Ye and Chaoyi Huang and Zixi Chen and Ji Zeng and Hantao Jiang and Pengfei Liu and Yiwei Wang and Ming-Hsuan Yang},
      year={2026},
      eprint={2602.01655},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.01655}, 
}

ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development ProjDevBench: AI编程智能体端到端项目开发基准测试