Rendered at 05:03:58 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
noobcoder 18 minutes ago [-]
Passing tests doesn’t mean you have a working codebase.
Benchmarks that rely on a fixed test suite create a real optimization problem agents (or/and even humans) learn to satisfy the tests rather than preserve the deeper properties that make the system maintainable. AI write test cases which it thinks is easier for it to satisfy and not adhere-ing to business logic
We see this firsthand at Prismor with auto generated security fixes. Even with the best LLMs, validating fixes is the real bottleneck our pipeline struggles to exceed 70% on an internal golden dataset (which itself is somewhat biased).
Many patches technically fix the vulnerability but introduce semantic regressions or architectural drift. Passing tests is a weak signal and proving a fix is truly safe to merge is much harder
devonkelley 2 hours ago [-]
25% regression rate on the best model is the number people should be sitting with here. That means 1 in 4 commits from your agent is breaking something that used to work. In any human team that would get you a serious conversation. We keep benchmarking agents like they're taking a test but the actual failure mode in production is slow accumulation of regressions nobody catches until the whole thing is on fire.
agent5ravi 12 hours ago [-]
The resolve rate numbers are interesting but I keep coming back to the regression question. In my experience doing code review on a real codebase, the hard part of maintenance is not fixing the thing that broke. It is understanding whether your fix preserves the invariants the original author had in mind but did not write down.
A benchmark that checks CI pass/fail captures the first part. It cannot capture the second. An agent that makes CI green by weakening an assertion or bypassing a check will score well here but create a time bomb.
The monorepo point from yuyuqueen hits this. When the agent can see the full dependency graph, it is less likely to fix something locally while breaking a downstream assumption. The biggest maintenance failures I have seen are not wrong logic. They are fixes that are locally correct but violate an unwritten contract between components.
rekornode 8 hours ago [-]
CI pass/fail captures regression, but there's a layer beneath it that benchmarks can't touch: what exactly did the agent submit to each external API, and can you prove it after the fact?
In the benchmark context this doesn't matter everything runs locally. In production it does. The agent calls a third-party service at 2am, the service claims it returned an error, your agent retried and billed you twice. Your logs say one thing, their logs say another.
The integrity problem isn't just "did the code work" it's "what was the exact request/response pair, timestamped, by whom, provably." CI solves the first. Something else has to solve the second.
westurner 12 hours ago [-]
> It is understanding whether your fix preserves the invariants the original author had in mind but did not write down.
This may also be the limit to the quality of an automated port to another language. What isn't encoded as automated tests or manual test procedure cannot be verified.
So often I'm amazed at what it's possible to accomplish from a prompt that's certainly insufficient with insufficient context. "It should have been necessary to specify more context there," or "I would have thought that it wasn't possible to do that without reading in more context than just one source code file," and then a few prompts later, "there's where we failed for trying to skimp on context"
To prevent architectural rework as a human developer also requires substantial ahead-of-time codebase review.
Are AGENTS.md files the best place to summarize more comprehensive codebase review and useful dense context like guidelines for testing and architectural components in order to avoid rework?
oliver_dr 8 hours ago [-]
[dead]
mentalgear 18 hours ago [-]
Claude wins by a large margin
* Claude Opus 4.6 : 0.71
* Claude Opus 4.5 : 0.51
* KIMI-K2.5 : 0.37
* GLM-5 : 0.36
* GPT-5.2 : 0.23
Note: later GPT versions seem to be only available within openAi's proprietary codex cli, so can't be tested - and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
---
Of course, the interesting follow-up question is: How well perform these models with added agent tooling ("harness") ?
Maybe someone has tokens to burn and can run a matrix of agent tools over the top models and provide the results?
mike_hearn 16 hours ago [-]
It's the other way around - Claude Code is the proprietary one. Codex CLI is open source:
You can definitely access the latest models via the API. That's how Codex CLI works.
pizlonator 13 hours ago [-]
gpt-5.3 was not accessible via API, at least for me
But it was in codex
andai 14 hours ago [-]
>if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
Well that's already not a very fair comparison, we've known for years (one of the early-ish LLM papers, maybe someone knows which one) that prompting makes an enormous difference on agent performance, and most strikingly, the same prompt that massively boosts performance on one model, can massively reduce performance on another.
So you already need to fine-tune the prompts for the model, if you want anything approaching best results.
Now what's really amusing is that if you run models without their official harness, they can actually do way better on some benchmarks! [0] e.g. On Terminal Bench 2, Claude Opus 4.6 goes from #33 (Claude Code) to #5 (custom harness). Similar results for Codex.
Now, this is "for this one very specific benchmark", but I still thought it was funny, since you'd expect "the harness made by the same company" to be the best for all tasks, but that's clearly not the case. (For specific tasks, it's actually quite trivial to outperform a general purpose harness.)
I reached the same conclusion. I tried using both for my personal investment ambient using agent-pair programming to build and agentic intelligence layer for stocks and the difference between the 2 models if astounding.
pizlonator 13 hours ago [-]
> and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
But the interesting comparison when evaluating coding agent capabilities is to evaluate the offerings given to users.
So this means comparing Claude Code to Codex to whatever CLI tools Kimi, GLM, and others give you.
And it might mean throwing Cursor, OpenCode, Amp, Pi, mini-swe-agent, etc into the mix
climike 13 hours ago [-]
We are working on supporting agent harnesses @ www.cliwatch.com, so both 1. LLM model as well 2. LLM model + harness performance can be evaluated against your software/CLI. We also support building evals against your doc suite. End result is that you’ll feel more comfortable shipping CLIs that work for your agentic users!:)
gizmodo59 16 hours ago [-]
Unfortunately the paper doesn’t include gpt 5.3 which was released around the same time as opus 4.6 and also gpt 5.4 few days back. Both are available via api
IMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.
I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)
jasonjmcghee 13 hours ago [-]
> model vendors know best on giving the best harness
This was only true for Claude Code for a while. Codex was poor and Gemini was unusable.
Since then Codex has gotten quite good.
jsemrau 1 hours ago [-]
It still fubars my code regularly at 11x the price.
Github Copilot Agentic Mode + Sonnet 4.6 is stable and inexpensive.
p1esk 16 hours ago [-]
Are you saying they did not use native harnesses like Claude Code or Codex? How did they do it then?
50lo 18 hours ago [-]
It’d be interesting to see this compared against a human baseline — e.g., a competent engineer with a fixed time budget on the same tasks.
KronisLV 19 hours ago [-]
> The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository.
This seems like a really cool thing to benchmark! Technically it'd be possible to take GitHub repos that the AI orgs probably already have, cross-reference the code against the issues and regressions, and train/validate on that.
"Vibe coded stuff gets hard to maintain and will end up buggy." Yeah, so make models that deal with that better, optimize for maintainability and consistency.
Cool to see Claude doing decently though!
woadwarrior01 18 hours ago [-]
> Cool to see Claude doing decently though!
The scales do seem to be tipped in its favor (cf: my other comment in this thread).
baalimago 16 hours ago [-]
Replace "Agent" with "Employee" and apply the same algorithm. Evaluate employee efficiency. Profit?
KronisLV 16 hours ago [-]
I'd unironically (and privately) want to do that with the code of both myself and those around me - to maybe see who I should listen more to, as well as who maybe less (ideally down to the feature level), because everyone has opinions, sometimes loud ones, but some approaches lead to a lot of churn and issues over the years.
yuyuqueen 12 hours ago [-]
The regression rates match what I saw early on with Claude Code on my monorepo. The fix was structural, not model-level: keeping everything in a single tree (packages, tests, docs, CI config) so the agent sees downstream effects of any change. When context is split across repos, agents cheerfully break imports because they literally can't see what depends on what.
Something hard to capture in benchmarks: project-level conventions. A well-maintained CLAUDE.md at the repo root — describing architecture, naming patterns, test conventions — gives the agent context it internalizes before touching code. My regression rate dropped noticeably once I started maintaining that kind of project metadata. Model choice is only half the equation — the other half is how well you've structured the information environment the agent works in.
smy20011 12 hours ago [-]
It interesting to see that the eval set becoming more and more expensive. Previously we just need to evaluate one test set, right now we need to create a lot of diffs and run a lot of tests.
rurban 9 hours ago [-]
The zero regression rate graph at the end is exactly my experience. Only Opus is useful right now, the rest are juniors.
jbergqvist 15 hours ago [-]
Would have loved to see a more detailed breakdown of performance by task type. The commit metadata is right there, seems straightforward to tag commits as feature vs refactor vs bug fix vs API change and report per-category numbers.
challengerVIE 20 hours ago [-]
To me using agents daily, the long term vision with maintainability in mind really makes the difference between us humans and agents, I like the idea. However evaluating long term maintainability over an average of just 500 loc changes does not sound like long term maintainability being measured here
woadwarrior01 18 hours ago [-]
Interesting benchmark.
I can't help but notice that they're benchmarking Opus 4.6 (Anthropic's latest and greatest model) against GPT-5.2 (which is three generations behind OpenAI's latest coding models: GPT-5.2-Codex, GPT-5.3-Codex and the latest GPT-5.4).
aurareturn 18 hours ago [-]
As far as I know, OpenAI did not release 5.3 Codex in their API. You can only use it with Codex CLI or app.
baalimago 16 hours ago [-]
It's there, you just need to use it with the responses API. Set model field to 'gpt-5.3-codex'
re-thc 17 hours ago [-]
5.2 and 5.2 Codex is arguably the same gen.
jasonjmcghee 13 hours ago [-]
Sure, but one is fine-tuned for what they are testing and one is not.
PunchyHamster 17 hours ago [-]
I'm sure with benchmarks like these future LLMs will be optimized to hide regressions by "fixing" test framework too
pixl97 14 hours ago [-]
Isn't misalignment great.
qsera 15 hours ago [-]
>Alibaba Group
verdverm 21 hours ago [-]
Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.
woeirua 14 hours ago [-]
Uh, Opus 4.6 avoids introducing regressions 75% of the time?
verdverm 13 hours ago [-]
So 1/4 times it does not introduce a regression. That's still pretty bad imo. If 1/4 commits introduced regressions, what would your team do?
We are talking about regressions, what once worked no longer does, and should be measured in 9s
notduncansmith 12 hours ago [-]
You overestimate many teams I think.
entrustai 15 hours ago [-]
[dead]
raphaelmolly8 12 hours ago [-]
[dead]
devcraft_ai 21 hours ago [-]
[dead]
coder_decoder 14 hours ago [-]
[flagged]
jlebensold 14 hours ago [-]
I've been building a similar loop with jetty.io for the last few months exclusively focused on data science workflows. I think that there's a lot of hill-climbing that can be accomplished by having a clear runbook.
calvinmorrison 14 hours ago [-]
I've been.... and they genuinely... And honestly? the real x is that. it went from X to Y.
We see this firsthand at Prismor with auto generated security fixes. Even with the best LLMs, validating fixes is the real bottleneck our pipeline struggles to exceed 70% on an internal golden dataset (which itself is somewhat biased).
Many patches technically fix the vulnerability but introduce semantic regressions or architectural drift. Passing tests is a weak signal and proving a fix is truly safe to merge is much harder
A benchmark that checks CI pass/fail captures the first part. It cannot capture the second. An agent that makes CI green by weakening an assertion or bypassing a check will score well here but create a time bomb.
The monorepo point from yuyuqueen hits this. When the agent can see the full dependency graph, it is less likely to fix something locally while breaking a downstream assumption. The biggest maintenance failures I have seen are not wrong logic. They are fixes that are locally correct but violate an unwritten contract between components.
This may also be the limit to the quality of an automated port to another language. What isn't encoded as automated tests or manual test procedure cannot be verified.
So often I'm amazed at what it's possible to accomplish from a prompt that's certainly insufficient with insufficient context. "It should have been necessary to specify more context there," or "I would have thought that it wasn't possible to do that without reading in more context than just one source code file," and then a few prompts later, "there's where we failed for trying to skimp on context"
To prevent architectural rework as a human developer also requires substantial ahead-of-time codebase review.
Are AGENTS.md files the best place to summarize more comprehensive codebase review and useful dense context like guidelines for testing and architectural components in order to avoid rework?
* Claude Opus 4.6 : 0.71
* Claude Opus 4.5 : 0.51
* KIMI-K2.5 : 0.37
* GLM-5 : 0.36
* GPT-5.2 : 0.23
Note: later GPT versions seem to be only available within openAi's proprietary codex cli, so can't be tested - and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.
---
Of course, the interesting follow-up question is: How well perform these models with added agent tooling ("harness") ?
Maybe someone has tokens to burn and can run a matrix of agent tools over the top models and provide the results?
https://github.com/openai/codex
You can definitely access the latest models via the API. That's how Codex CLI works.
But it was in codex
Well that's already not a very fair comparison, we've known for years (one of the early-ish LLM papers, maybe someone knows which one) that prompting makes an enormous difference on agent performance, and most strikingly, the same prompt that massively boosts performance on one model, can massively reduce performance on another.
So you already need to fine-tune the prompts for the model, if you want anything approaching best results.
Now what's really amusing is that if you run models without their official harness, they can actually do way better on some benchmarks! [0] e.g. On Terminal Bench 2, Claude Opus 4.6 goes from #33 (Claude Code) to #5 (custom harness). Similar results for Codex.
Now, this is "for this one very specific benchmark", but I still thought it was funny, since you'd expect "the harness made by the same company" to be the best for all tasks, but that's clearly not the case. (For specific tasks, it's actually quite trivial to outperform a general purpose harness.)
[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0
But the interesting comparison when evaluating coding agent capabilities is to evaluate the offerings given to users.
So this means comparing Claude Code to Codex to whatever CLI tools Kimi, GLM, and others give you.
And it might mean throwing Cursor, OpenCode, Amp, Pi, mini-swe-agent, etc into the mix
https://developers.openai.com/api/docs/models/gpt-5.3-codex
IMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.
I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)
This was only true for Claude Code for a while. Codex was poor and Gemini was unusable.
Since then Codex has gotten quite good.
This seems like a really cool thing to benchmark! Technically it'd be possible to take GitHub repos that the AI orgs probably already have, cross-reference the code against the issues and regressions, and train/validate on that.
The dataset would need to be way bigger to get close to the likes of SWE-bench: https://www.swebench.com/original.html
"Vibe coded stuff gets hard to maintain and will end up buggy." Yeah, so make models that deal with that better, optimize for maintainability and consistency.
Cool to see Claude doing decently though!
The scales do seem to be tipped in its favor (cf: my other comment in this thread).
Something hard to capture in benchmarks: project-level conventions. A well-maintained CLAUDE.md at the repo root — describing architecture, naming patterns, test conventions — gives the agent context it internalizes before touching code. My regression rate dropped noticeably once I started maintaining that kind of project metadata. Model choice is only half the equation — the other half is how well you've structured the information environment the agent works in.
I can't help but notice that they're benchmarking Opus 4.6 (Anthropic's latest and greatest model) against GPT-5.2 (which is three generations behind OpenAI's latest coding models: GPT-5.2-Codex, GPT-5.3-Codex and the latest GPT-5.4).
We are talking about regressions, what once worked no longer does, and should be measured in 9s
dang permaban this AI slop please