At the Manus moment in early 2025, the open question was whether an agent could do the task at all. Eighteen months on, it mostly can — the coding ceiling rose a full generation, computer use reached the human band, and a single model can now run for half a day at coinflip odds. The live questions have moved downstream: reliability, and whether an agent can own a job end to end.
69.2%
SWE-bench Pro — real GitHub issues, no answer leakage
83.4%
OSWorld computer use — inside the human range
~12h
Autonomy horizon — but only at 50% reliability
2.5%
Real paid work done end-to-end (Remote Labor Index)
The coding ceiling rose a generation
Higher is better · SWE-bench Pro — resolving real, post-cutoff GitHub issues across a full repo
Claude Opus 4.8
69.2%
GLM-5.2 open weights
62.1%
GPT-5.5
58.6%
Kimi K2.6
58.6%
Read the harder number. SWE-bench Verified is saturating — Opus 4.8 scores ~88.6% and the field clusters near 88–95%, so it no longer separates models. Pro draws from actively maintained repos created after training cutoff, where multi-file diffs require real architectural understanding. The gap there is the honest signal — and open weights (GLM-5.2) now trail the closed frontier by only a few points. One asterisk on all of it: the splashier long-horizon scores tend to be self-reported by the same lab that shipped the model — the eternal "trust me bro" benchmarks — worth precisely your trust in the source until someone independent reruns them. The Pro figures above already survived that; treat the rest as marketing with a methodology section.
Computer use crossed into the human band
Higher is better · OSWorld-Verified — driving a real desktop with mouse and keyboard
Claude Opus 4.8
83.4%
GPT-5.5
78.7%
Gemini 3.1 Pro
76.2%
Human baseline range — 72–84%, depending on task category. Top models now land inside it.
Six months ago the leaders were GPT-5.4 at 75.0% and Opus 4.6 at 72.7% — barely at the floor of the human band. The frontier moved ~8 points in two model generations. Caveat: ~1 in 6 desktop tasks still fails, and scores are harness-dependent.
The autonomy split is the whole story
METR task-completion time horizon — how long a model works before its odds drop below a threshold. Note the log scale.
80% reliabilitydependable
~27 min
50% reliabilitycoinflip
~12 h
>16 h: beyond the ruler
1 min
10 min
30 min
1 h
4 h
12 h
24 h
Same model, two answers
The 50% horizon (Opus 4.6) reached ~12 hours; Claude Mythos Preview ran past METR's 16-hour measurement ceiling. But the 80% horizon — where you'd actually leave it unattended — is still ~27 minutes. Hours at a coinflip; half an hour you can trust.
And accelerating
The post-2023 doubling time is now ~4 months, down from the original ~7. The catch: confidence intervals are enormous (the ~12 h estimate spans roughly 5–65 h) because the benchmark is running out of tasks long enough to measure the frontier.
End to end, on real paid work: still the floor
Remote Labor Index (Scale AI / CAIS) — can an agent be the freelancer and finish the whole brief, not assist with parts of it?
2.5%
of 240 real, end-to-end freelance projects were completed to a standard a reasonable client would accept — by the single best agent. Every model tested came in under 3%. The benchmarks above measure isolated skills; this measures owning the deliverable.
$143,991
earned by the human freelancers who did this work
$1,720
value the top agent (Manus) completed acceptably
Manus
2.5%
Grok-4 / Sonnet 4.5
2.1%
GPT-5
1.7%
ChatGPT agent
1.3%
Gemini 2.5 Pro
0.8%
Bars are on the full 0–100% scale — human = 100% (an accepted deliverable). The slivers are the point. v1 figures (Oct 2025); the leaderboard is live, but the floor still stands through mid-2026. Models are creeping up on pairwise Elo, not yet on completion rate.
A new wrinkle: access to the frontier
For the first time, the most capable tier wasn't gated by capability — it was switched off by directive.
Jun 9, 2026
Fable 5 & Mythos 5 launch
Anthropic's most capable public models. Mythos Preview had already run past METR's 16 h ceiling.
Jun 12, 2026
US Commerce export directive
All access suspended for any foreign national, worldwide — citing a narrow jailbreak. First export control aimed at a model, not chips.
Jun 12, 5:21pm ET
Disabled globally within hours
Anthropic couldn't verify nationality in real time, so it pulled both for everyone. It disputes the order; litigation ongoing.
Public lifespan: 3 days.
Noted for the record, not the trend. The ban is a bummer and the precedent matters — but the capability it gated won't stay scarce. Open weights are already a few points behind the closed frontier (GLM-5.2 vs Opus 4.8), and other labs ship monthly. The interesting variable for planning isn't this one model's availability; it's what autonomous work is possible at all — which is what every panel above is really measuring.
Claimed in early 2025 → possible now
The Manus-era pitch, scored against June 2026 reality.
Domain
Claimed at the Manus moment
Possible now (Jun 2026)
Verdict
Autonomous coding
AI engineers ship features end-to-end
Scoped, multi-file work with review. SWE-Pro ~69% — not the saturated ~88–95% Verified figure. Still supervised on novel work.
Partial ↑
Long-running autonomy
Agents run for hours, unattended
50% horizon ~12 h (Mythos past the 16 h ceiling) — but 80% horizon ~27 min.
Partial
Computer use
Operate any software like a human
OSWorld ~83%, inside the 72–84% human band. Still ~1 in 6 tasks fails.
Partial → parity
Knowledge-work quality
AI produces real professional output
GDPval-AA: Opus 4.8 1890 > GPT-5.5 1769. Graded on deliverable quality, output is expert-grade.
Real
End-to-end paid work
Replaces freelancers and employees
RLI ~2.5% of real paid projects finished autonomously. Owning a whole job is still out of reach.
Can't
Access to the frontier
(not a constraint in 2025)
Top tier (Mythos / Fable 5) can be revoked overnight by directive. Likely transient.
New gate
Bottom line
The work is largely doable. What isn't yet here is autonomous, reliable, end-to-end ownership — an agent you hand a brief and trust to return finished work without a human in the loop. That gap, not raw capability, is where June 2026 actually sits.
Method & sources
Coding & computer use: SWE-bench Pro, OSWorld-Verified, MCP-Atlas and GDPval-AA figures are from published system cards plus independent reruns (Vellum, DataCamp, Vals AI). GLM-5.2's long-horizon numbers (FrontierSWE, SWE-Marathon) are vendor-reported by Z.ai; its second-place standing was corroborated independently by Code Arena's blind pairwise leaderboard.
Autonomy horizon: METR first-party measurements. Confidence intervals are wide (roughly ±2× in each direction) and METR itself flags that the suite is saturating above ~16 h. Treat point estimates as directional, not precise.
End-to-end work: Remote Labor Index v1 (Scale AI + Center for AI Safety) — 240 paid projects across 23 domains, scored by whether a reasonable client would accept the deliverable. Access timeline: Anthropic's public statement and contemporaneous reporting (CNBC, QZ, WSJ).