ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Pengrui Lu*1,3,4, Shiqi Zhang*1, Yunzhong Hou*2, Lyumanshan Ye1, Chaoyi Huang1, Zixi Chen1, Ji Zeng1, Hantao Jiang1, Pengfei Liu†1,4, Yiwei Wang†3, Ming-Hsuan Yang†3
1Shanghai Jiao Tong University, 2Beijing Institute of Technology, 3UC Merced, 4Shanghai Innovation Institute
*Equal contribution, †Corresponding authors
Task Comparison

Task comparison: Unlike benchmarks where coding agents modify code snippets from pre-existing codebases, ProjDevBench evaluates end-to-end repository construction from project-level requirements.

20
Problems
8
Categories
6
Agents
138
Avg. Turns
27.38%
Acceptance

Abstract

Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories.

Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends.

Benchmark Pipeline

Overview of the ProjDevBench evaluation pipeline.

Key Features

  • End-to-End Project Construction: Agents build complete repositories from scratch, not just patches or single files.
  • Dual Evaluation Protocol: Combines Online Judge execution-based testing with LLM-assisted code review for comprehensive assessment.
  • Diagnostic Feedback: Fine-grained verdict-level signals (Wrong Answer, TLE, MLE, Runtime Error, etc.) enable systematic failure analysis.
  • Multi-Agent Support: Evaluate Claude Code, Cursor, Gemini CLI, Codex, Augment, and GitHub Copilot.
  • Extended Interaction: Tasks demand sustained agent-environment interaction, averaging 138 turns and 4.81M tokens per problem.

🏆 Leaderboard

Performance on ProjDevBench across six coding agents and multiple LLM backends. Exec. represents the execution score from Online Judge, CR represents the code review score, and Final is the weighted combination (80% Exec. + 20% CR).

Codex Cursor Augment Claude Code GitHub Copilot Gemini CLI
Rank Agent Model Easy Exec. Easy CR Hard Exec. Hard CR Overall Exec. Overall CR Final ↓
🥇 Codex GPT-5 79.24 82.11 69.22 82.90 76.73 82.31 77.85
🥈 Cursor Gemini-3-Pro-Preview 72.87 88.67 71.47 80.03 72.52 86.51 75.32
🥉 Augment GPT-5 77.10 76.00 57.22 65.03 72.13 73.26 72.35
4 Cursor GPT-5 69.74 80.56 67.80 87.27 69.26 82.23 71.85
5 Cursor Sonnet-4.5 71.12 85.67 60.17 66.47 68.39 80.87 70.88
6 Augment Sonnet-4.5 69.14 92.56 56.81 67.43 66.06 86.28 70.10
7 Claude Code Sonnet-4.5 66.85 92.89 54.47 78.57 63.76 89.31 68.87
8 Gemini CLI Gemini-3-Pro-Preview 74.57 80.33 35.53 94.20 64.81 83.80 68.61
9 GitHub Copilot Sonnet-4.5 71.10 87.89 36.63 80.23 62.48 85.97 67.18
10 Codex Sonnet-4.5 66.07 68.22 31.88 83.23 57.52 71.98 60.41

📊 Key Findings

  • Best Overall: Codex + GPT-5 achieves 77.85% final score, leading in execution performance.
  • Model Impact: GPT-5 generally excels at execution, while Sonnet-4.5 shows stronger code review compliance.
  • Framework Stability: Cursor and Augment demonstrate stable performance across different base models, with all configurations achieving final scores above 70%.
  • Hard vs Easy: Performance gaps widen significantly on from-scratch construction tasks (Hard problems).

📈 Submission Status Distribution

Analysis of submission outcomes across all agents reveals that only 27.38% of submissions were accepted, with the majority failing due to wrong answers (41.86%) or time limit violations (13.91%).

Status Type Count Percentage
Accepted 484 27.38%
Wrong Answer 740 41.86%
Time Limit Exceeded 246 13.91%
Runtime Error 124 7.01%
Compile Error 80 4.52%
Memory Leak 62 3.51%
Memory Limit Exceeded 24 1.36%
Others 8 0.45%

📋 Problem Details

ProjDevBench contains 20 problems across 8 categories. Easy problems provide a partial codebase (project-completion), while Hard problems require from-scratch construction (project-creation).

Category Distribution

Distribution of ProjDevBench tasks across 8 categories.

ID Problem Name Category Difficulty Time Limit Memory Limit Avg Score
001 A+B Problem Algorithm Easy 1s 256 MiB 54.37
002 int2048 Big Integer Algorithm Easy 10s 190 MiB 48.19
003 ICPC Management System Management Hard 2s 512 MiB 52.07
004 Bookstore System Management Hard 10s 64 MiB 36.29
005 QOI Format Codec Algorithm Easy 10s 512 MiB 58.87
006 Minesweeper Game Easy 30s 256 MiB 53.51
007 BASIC Interpreter Interpreter Easy 5s 256 MiB 47.67
008 MOV Language Assembly Easy - - 54.70
009 STLite Vector Data Structure Easy 100s 768 MiB 58.46
010 STLite List Data Structure Easy 25s 768 MiB 30.76
011 STLite Priority Queue Data Structure Easy 15s 512 MiB 57.25
012 STLite Linked HashMap Data Structure Easy 24s 893 MiB 43.36
013 STLite Map Data Structure Easy 30s 893 MiB 58.21
014 Python Interpreter Interpreter Easy 16s 512 MiB 46.23
015 File Storage Storage Hard 16s 6 MiB 42.71
016 File Storage BPT Storage Hard 5s 64 MiB 40.11
017 Train Ticket System Management Hard 40s 47 MiB 53.24
018 Scheme Interpreter Interpreter Easy 1.5s 244 MiB 32.94
019 GPU Memory Optimization Optimization Easy 1s 244 MiB 36.89
020 Buddy Algorithm Optimization Easy 10s 244 MiB 33.33

Difficulty Definition: Easy = Project-completion (partial codebase provided), Hard = Project-creation (from-scratch construction)

🔬 Evaluation Methodology

ProjDevBench adopts a dual evaluation protocol that distinguishes between hard functional correctness and compliance at the rule level and specification level.

Execution-based Evaluation

  • Submissions evaluated on Online Judge platform
  • Comprehensive test suites verify functional correctness
  • Fine-grained verdict signals: CE, RE, WA, TLE, MLE
  • Weighted partial credit based on test case importance

Code Review

  • Rule-based Python scripts for explicit violations
  • LLM-based review for specification compliance
  • Detects forbidden library usage and hack solutions
  • Assesses adherence to submission requirements

Final Scoring Formula

Final Score = 0.8 × Execution Score + 0.2 × Code Review Score

Prioritizes functional correctness while penalizing specification violations.

🔍 Where End-to-End Coding Agents Fail

Specification Misalignment

Agents frequently generate syntactically correct frameworks but omit critical business logic, fail to distinguish development from submission contexts.

Edge Case Handling

Systematic weaknesses in boundary condition handling, leading to Wrong Answer and Runtime Error failures including null pointer dereferences.

Time Complexity Issues

Agents favor familiar but suboptimal patterns, using O(log N) maps where O(1) hash tables suffice, leading to TLE submissions.

Resource Management

Significant limitations in exception safety and memory management, preferring manual new/delete over RAII patterns, causing memory leaks.

📬 Contact

If you have any questions regarding ProjDevBench, feel free to reach out to us via email at lupengrui@sjtu.edu.cn, or directly submit a GitHub issue.

📝 Citation

If you find ProjDevBench useful for your research, please consider citing our paper:

@misc{lu2026projdevbenchbenchmarkingaicoding,
      title={ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development}, 
      author={Pengrui Lu and Shiqi Zhang and Yunzhong Hou and Lyumanshan Ye and Chaoyi Huang and Zixi Chen and Ji Zeng and Hantao Jiang and Pengfei Liu and Yiwei Wang and Ming-Hsuan Yang},
      year={2026},
      eprint={2602.01655},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.01655}, 
}