CAPABILITY WATCH / agentic AI
Updated June 2026
What can agents actually do now

Agentic capabilities, June 2026

At the Manus moment in early 2025, the open question was whether an agent could do the task at all. Eighteen months on, it mostly can — the coding ceiling rose a full generation, computer use reached the human band, and a single model can now run for half a day at coinflip odds. The live questions have moved downstream: reliability, and whether an agent can own a job end to end.

69.2%
SWE-bench Pro — real GitHub issues, no answer leakage
83.4%
OSWorld computer use — inside the human range
~12h
Autonomy horizon — but only at 50% reliability
2.5%
Real paid work done end-to-end (Remote Labor Index)

The coding ceiling rose a generation

Higher is better · SWE-bench Pro — resolving real, post-cutoff GitHub issues across a full repo

Claude Opus 4.8
69.2%
GLM-5.2 open weights
62.1%
GPT-5.5
58.6%
Kimi K2.6
58.6%
Read the harder number. SWE-bench Verified is saturating — Opus 4.8 scores ~88.6% and the field clusters near 88–95%, so it no longer separates models. Pro draws from actively maintained repos created after training cutoff, where multi-file diffs require real architectural understanding. The gap there is the honest signal — and open weights (GLM-5.2) now trail the closed frontier by only a few points. One asterisk on all of it: the splashier long-horizon scores tend to be self-reported by the same lab that shipped the model — the eternal "trust me bro" benchmarks — worth precisely your trust in the source until someone independent reruns them. The Pro figures above already survived that; treat the rest as marketing with a methodology section.

Computer use crossed into the human band

Higher is better · OSWorld-Verified — driving a real desktop with mouse and keyboard

Claude Opus 4.8
83.4%
GPT-5.5
78.7%
Gemini 3.1 Pro
76.2%
Human baseline range — 72–84%, depending on task category. Top models now land inside it.
Six months ago the leaders were GPT-5.4 at 75.0% and Opus 4.6 at 72.7% — barely at the floor of the human band. The frontier moved ~8 points in two model generations. Caveat: ~1 in 6 desktop tasks still fails, and scores are harness-dependent.

The autonomy split is the whole story

METR task-completion time horizon — how long a model works before its odds drop below a threshold. Note the log scale.

80% reliabilitydependable
~27 min
50% reliabilitycoinflip
~12 h
>16 h: beyond the ruler
1 min
10 min
30 min
1 h
4 h
12 h
24 h
Same model, two answers

The 50% horizon (Opus 4.6) reached ~12 hours; Claude Mythos Preview ran past METR's 16-hour measurement ceiling. But the 80% horizon — where you'd actually leave it unattended — is still ~27 minutes. Hours at a coinflip; half an hour you can trust.

And accelerating

The post-2023 doubling time is now ~4 months, down from the original ~7. The catch: confidence intervals are enormous (the ~12 h estimate spans roughly 5–65 h) because the benchmark is running out of tasks long enough to measure the frontier.

End to end, on real paid work: still the floor

Remote Labor Index (Scale AI / CAIS) — can an agent be the freelancer and finish the whole brief, not assist with parts of it?

2.5%
of 240 real, end-to-end freelance projects were completed to a standard a reasonable client would accept — by the single best agent. Every model tested came in under 3%. The benchmarks above measure isolated skills; this measures owning the deliverable.
$143,991
earned by the human freelancers who did this work
$1,720
value the top agent (Manus) completed acceptably
Manus
2.5%
Grok-4 / Sonnet 4.5
2.1%
GPT-5
1.7%
ChatGPT agent
1.3%
Gemini 2.5 Pro
0.8%
Bars are on the full 0–100% scale — human = 100% (an accepted deliverable). The slivers are the point. v1 figures (Oct 2025); the leaderboard is live, but the floor still stands through mid-2026. Models are creeping up on pairwise Elo, not yet on completion rate.

A new wrinkle: access to the frontier

For the first time, the most capable tier wasn't gated by capability — it was switched off by directive.

Jun 9, 2026
Fable 5 & Mythos 5 launch
Anthropic's most capable public models. Mythos Preview had already run past METR's 16 h ceiling.
Jun 12, 2026
US Commerce export directive
All access suspended for any foreign national, worldwide — citing a narrow jailbreak. First export control aimed at a model, not chips.
Jun 12, 5:21pm ET
Disabled globally within hours
Anthropic couldn't verify nationality in real time, so it pulled both for everyone. It disputes the order; litigation ongoing.
Public lifespan: 3 days.
Noted for the record, not the trend. The ban is a bummer and the precedent matters — but the capability it gated won't stay scarce. Open weights are already a few points behind the closed frontier (GLM-5.2 vs Opus 4.8), and other labs ship monthly. The interesting variable for planning isn't this one model's availability; it's what autonomous work is possible at all — which is what every panel above is really measuring.

Claimed in early 2025 → possible now

The Manus-era pitch, scored against June 2026 reality.

DomainClaimed at the Manus momentPossible now (Jun 2026)Verdict
Autonomous codingAI engineers ship features end-to-endScoped, multi-file work with review. SWE-Pro ~69% — not the saturated ~88–95% Verified figure. Still supervised on novel work.Partial ↑
Long-running autonomyAgents run for hours, unattended50% horizon ~12 h (Mythos past the 16 h ceiling) — but 80% horizon ~27 min.Partial
Computer useOperate any software like a humanOSWorld ~83%, inside the 72–84% human band. Still ~1 in 6 tasks fails.Partial → parity
Knowledge-work qualityAI produces real professional outputGDPval-AA: Opus 4.8 1890 > GPT-5.5 1769. Graded on deliverable quality, output is expert-grade.Real
End-to-end paid workReplaces freelancers and employeesRLI ~2.5% of real paid projects finished autonomously. Owning a whole job is still out of reach.Can't
Access to the frontier(not a constraint in 2025)Top tier (Mythos / Fable 5) can be revoked overnight by directive. Likely transient.New gate
Bottom line

The work is largely doable. What isn't yet here is autonomous, reliable, end-to-end ownership — an agent you hand a brief and trust to return finished work without a human in the loop. That gap, not raw capability, is where June 2026 actually sits.

Method & sources

Coding & computer use: SWE-bench Pro, OSWorld-Verified, MCP-Atlas and GDPval-AA figures are from published system cards plus independent reruns (Vellum, DataCamp, Vals AI). GLM-5.2's long-horizon numbers (FrontierSWE, SWE-Marathon) are vendor-reported by Z.ai; its second-place standing was corroborated independently by Code Arena's blind pairwise leaderboard.

Autonomy horizon: METR first-party measurements. Confidence intervals are wide (roughly ±2× in each direction) and METR itself flags that the suite is saturating above ~16 h. Treat point estimates as directional, not precise.

End-to-end work: Remote Labor Index v1 (Scale AI + Center for AI Safety) — 240 paid projects across 23 domains, scored by whether a reasonable client would accept the deliverable. Access timeline: Anthropic's public statement and contemporaneous reporting (CNBC, QZ, WSJ).