anthropic claude opus

Claude Opus 4.7 for Coding Agents: Benchmarks Breakdown

April 17, 2026 · HostAgentes Team · 8 min read

When Anthropic shipped Claude Opus 4.7 on April 16, 2026, they released three headline coding numbers:

  • 70% pass rate on CursorBench (up from 58% on Opus 4.6)
  • +13% on a 93-task coding benchmark
  • 3× more production tasks resolved on Rakuten-SWE-Bench

Those are the numbers in the press release. The question for anyone actually shipping a coding agent is: what do those translate to in production? This post breaks down each benchmark, explains what it measures, and maps each to concrete behaviors you will see — or should measure — in your own Paperclip-powered coding agent.

TL;DR: The 70% CursorBench number is the single most important datapoint for agent builders. It means Opus 4.7 completes most real editor-driven agent loops without falling out of context or giving up. The 3× Rakuten number means fewer human escalations per 100 tasks. Both translate to lower cost per successful PR in real repositories.

CursorBench — the most agent-relevant number

What it measures: CursorBench is a community benchmark derived from real Cursor editor usage. Tasks are drawn from actual developer sessions: “rename this function and update all call sites,” “add a unit test for this method,” “refactor this component to use hooks.” The benchmark is closer to real coding agent workloads than any single academic suite.

Opus 4.7 score: 70%. That’s up from 58% on Opus 4.6 — a 12-point jump. For context, community estimates put GPT-4o at ~60% and Gemini 2.5 Pro at ~55% on the same benchmark (neither vendor publishes official CursorBench numbers, so these are less precise).

What this means in your agent:

  • More tasks finish without the agent giving up mid-run
  • Fewer “I couldn’t figure out how to do that” responses
  • Tighter tool call sequences (less thrashing between file reads)
  • Higher first-try success on “small but fiddly” tasks (renames, refactors, test additions)

If your coding agent currently succeeds on ~60% of dev-flow tasks, expect ~70-75% on 4.7 assuming no prompt changes. Teams willing to tune prompts for 4.7’s better instruction following have reported ~80%.

The 93-task coding benchmark — the quality ceiling

What it measures: Anthropic’s internal 93-task coding benchmark spans language coverage (Python, TypeScript, Go, Rust, Java), task types (writing new code, fixing bugs, understanding existing code, explaining diffs), and difficulty levels (junior-appropriate to senior-lead).

Opus 4.7 score: +13% over Opus 4.6. Anthropic notes the gain concentrates on the hardest tasks — meaning the tail of “agents that used to fail completely” is where you’ll see the biggest improvement.

What this means in your agent:

  • Complex multi-file refactors that 4.6 would give up on — 4.7 completes them
  • Cross-language tasks (debug this JS code by reading this Go backend) — better understanding transfer
  • Architectural-level reasoning (design a new service, pick a caching strategy) — more coherent output

If your agent does simple tasks, the 13% improvement is a nice bonus. If your agent does hard tasks, it’s transformative — the hardest 20% of tasks in your backlog may move from “usually fails” to “usually succeeds.”

Rakuten-SWE-Bench — the production realism test

What it measures: Rakuten-SWE-Bench is derived from real Rakuten engineering tickets — production bugs, feature requests, and refactors drawn from internal repositories. Tasks require reading multi-file codebases, understanding business logic, and submitting patches that pass real test suites.

Opus 4.7 score: 3× more production tasks resolved vs Opus 4.6. This is the benchmark that maps most cleanly to “is my autonomous coding agent viable?”

What this means in your agent:

  • If 4.6 closed 10 tickets per day autonomously, 4.7 closes ~30
  • Fewer “agent opened a PR but the tests fail” cases
  • Fewer escalations to a human for “why did the agent do that?”
  • More realistic economics for fully autonomous ticket-to-PR workflows

This is the number that will make or break the business case for an autonomous coding agent in 2026.

What drives the improvement

Anthropic’s release notes point to five underlying model improvements:

1. Better instruction following. The agent does what you ask on the first try more often. Subtle, but compounding — every prompt that doesn’t need a retry saves 100% of that call’s budget.

2. Enhanced long-context reasoning. The 1M context window doesn’t just hold more text — the model actually reasons over it more coherently. Pointer chains across files degrade less.

3. Improved file-system memory usage. For agents that read and write intermediate files during a run, 4.7 manages that state better. This matters especially for multi-hour autonomous runs.

4. Better autonomous task completion. The model is less likely to say “let me know if you want me to continue” mid-task. It just continues.

5. Higher-resolution vision. Not directly a coding benchmark, but if your coding agent ever reads screenshots of error messages, diagrams, or UI mocks, the 3.75 MP support is notable.

What the benchmarks don’t measure

A few things to keep in mind:

Cold-start latency. Opus 4.7 with xhigh effort can take 3-8 seconds longer to first token. If your coding agent is interactive (live typing), default or high effort gives you most of the quality gain without the wait.

Agent-specific prompting. Benchmarks use clean, researcher-tuned prompts. Your production prompts may have baggage from compensating for 4.6 — try simplifying them now that 4.7 follows instructions better.

Tool ecosystem. The benchmark scores measure the model in isolation. Your agent’s overall performance depends on your tools (code search, test runner, linter). A great model with bad tools still produces mediocre results.

Real-world repo scale. CursorBench and SWE-Bench tasks are real, but they’re curated. Your production repo may have 10× the file count, 100× the history, and unique conventions the model has never seen. Expect benchmark scores to translate to directional improvements, not absolute production numbers.

Comparing Opus 4.7 to the rest of the frontier

BenchmarkOpus 4.7Opus 4.6GPT-4oGemini 2.5 Pro
CursorBench70%58%~60%~55%
HumanEval~95%~92%~93%~92%
SWE-Bench VerifiedTBD~52%~48%~46%
MMLU~91%~89%~88%~87%

GPT-4o and Gemini numbers are community-reported or estimated. Anthropic is the only vendor reporting CursorBench directly.

The gap between Opus 4.7 and everything else is widest on the agent-style benchmarks (CursorBench, Rakuten-SWE-Bench) and narrowest on the academic benchmarks (HumanEval, MMLU). That pattern tells you something: Opus 4.7 was specifically trained to be good at autonomous agent work, not to climb academic leaderboards.

How to measure 4.7 in your own agent

Don’t trust any vendor benchmark blindly. Run your own measurement over 48-72 hours:

1. Fix a frozen task set. Pick 50-100 tasks your agent has seen recently, covering the distribution of difficulty you care about. These are your “golden set.”

2. Run the set on 4.6 (baseline). Record: pass/fail, tokens used, wall-clock time, tool call count.

3. Run the set on 4.7 (treatment). Same recording.

4. Compute the delta. You care about four numbers:

  • Pass rate delta (should be +5 to +15 pp)
  • Tokens per successful task (should be -10 to -25%)
  • Wall-clock per successful task (may be ±20%)
  • Tool calls per successful task (should be -20 to -40%)

5. Decide per-agent. Some agents will benefit more than others. Migrate the clear winners first, keep collecting data on the marginal cases.

On Paperclip, this entire loop takes a single dashboard action: clone the agent, swap the model, replay tasks, compare metrics. No infra work.

Practical recommendation

If you run a coding agent today on Opus 4.6, the path forward is:

  1. Today: Migrate one low-risk internal agent (code review, PR summarizer) to 4.7.
  2. Day 2-3: Review the four deltas above.
  3. Day 4-7: Batch-migrate production coding agents to 4.7.
  4. Week 2: Start experimenting with xhigh effort and Task budgets on the hardest agents.

For agents that are not coding-focused (classification, routing, short-form generation), Opus 4.7 is usually overkill — stay on Sonnet or Haiku. The benchmarks don’t tell that story because they focus on where 4.7 is strongest.

FAQ

Does Opus 4.7 replace Opus 4.6? No — both remain available. Anthropic has committed to keeping 4.6 running through at least Q4 2026. Migrate on your own timeline.

Is 4.7 always better for coding? For complex autonomous coding, almost always. For simple completions, Sonnet or a smaller model may be more cost-effective.

How does 4.7 compare to OpenAI’s reasoning models on coding? On autonomous agent-style coding (CursorBench, SWE-Bench), 4.7 leads. On single-shot hard logic puzzles, OpenAI’s dedicated reasoning models can match or exceed — but those models are priced higher and slower.

Can I run 4.7 self-hosted? No. Opus 4.7 is API-only. If self-hosting matters, you’ll be on Llama 3.3 or open-weight Mistral variants — both of which trail Opus 4.7 significantly on agent benchmarks but give you full data control.

Where’s the full benchmark data? Anthropic’s model card at anthropic.com/claude/opus has the full numbers. Third-party benchmark aggregators (LMSYS, etc.) will likely update within the first week.


Related: Deploy Claude Opus 4.7 on Paperclip → · Migrate Opus 4.6 → 4.7 → · Opus 4.7 vs GPT-4o vs Gemini →

H

HostAgentes Team

Engineering & product

The HostAgentes team is part of ZUI TECHNOLOGY, S.L. — we build managed hosting for AI agents and write about the infrastructure, models and patterns we use ourselves.

About us →

Ready to deploy your agents?

Managed hosting from $9.99/mo. Zero headaches.

View plans