
New Benchmark Scores AI Models Near Zero
New benchmark scores AI models near zero just days after Nvidia CEO declared AGI achieved
Note: This article examines a contested debate with speculative elements. Events described reflect positions taken by named individuals; interpretations of AGI status remain actively disputed.
Summary
On March 23, 2026, Nvidia CEO Jensen Huang told Lex Fridman that artificial general intelligence has been achieved. On March 26 — three days later — the ARC Prize Foundation released ARC-AGI-3: 135 novel interactive environments that no AI model had seen during training. Humans solved them at 100% efficiency. The best AI model tested scored 0.37%. Grok-4.20 scored exactly zero. The debate is not about capability. It is about what "general" means.
The Scores
| System | ARC-AGI-3 Score (RHAE) |
|---|---|
| Humans | 100% |
| Google Gemini 3.1 Pro | 0.37% |
| OpenAI GPT-5.4 | 0.26% |
| Anthropic Claude Opus 4.6 | 0.25% |
| xAI Grok-4.20 | 0.00% |
| ARC Prize for passing | $2,000,000 |
What Jensen Huang Said
On March 23, Huang made the most definitive public statement of his career on the topic:
"I think it's now. I think we've achieved AGI." — Jensen Huang, Nvidia CEO, Lex Fridman Podcast, March 23, 2026
Huang's definition of AGI is operational: an AI capable of executing sophisticated multi-step workflows, writing production-grade code, and — in principle — running a technology company to a $1 billion valuation without requiring a human to oversee each step. By that standard, he argues, Claude Code, GPT-5.4 with tool use, and multi-agent Grok configurations already qualify.
The statement landed in CNBC, Forbes, Fortune, and Yahoo Finance within hours. The response from the research community was skeptical.
Three Days Later: ARC-AGI-3
François Chollet — the creator of the original ARC-AGI benchmark and co-founder of the ARC Prize Foundation — published ARC-AGI-3 on March 26. The timing relative to Huang's declaration was not coincidental.
ARC-AGI-3 is built to test exactly what Huang's definition ignores: genuine generalization. The benchmark presents AI with 135 interactive environments that could not have appeared in any training data — novel problem spaces that require exploration and reasoning from scratch, with no instructions. The scoring metric, Relative Human Action Efficiency (RHAE), also penalizes inefficiency: solve the puzzle with ten times the actions a human would need, and you earn only 1% credit for that environment.
To prevent gaming, 110 of the 135 environments are withheld from public access. Only 25 are open for testing. No model has come anywhere near a score that would claim the $2 million prize.
Why the Scores Are So Low
The performance gap is not surprising to researchers who study AI generalization. Today's frontier models are extraordinarily capable on tasks that resemble their training distribution. They can write sophisticated code, synthesize complex documents, and solve math problems at or above PhD level — because they have seen millions of examples of such tasks.
ARC-AGI-3 removes that advantage entirely. The environments are designed to be unlike anything in any dataset. There are no instructions. There is no prior training data that maps to the structure of each puzzle. Performance requires the kind of flexible, exploratory reasoning that humans develop naturally and that current AI architectures do not have.
Grok's zero score is especially revealing. Grok-4.20 benchmarks well on standard tests that measure memorized knowledge and pattern matching. On ARC-AGI-3, it scored zero on every novel environment — indicating no ability to generalize beyond training, not even enough to make productive exploratory moves.
Two Definitions, One Unresolved Argument
The Huang-Chollet disagreement is structural, not factual. They are measuring different things.
| Jensen Huang | François Chollet | |
|---|---|---|
| Definition of AGI | AI that executes complex workflows and creates commercial value at scale | AI that generalizes to novel situations without prior training, as any human does |
| Current AI status | Already achieved | Not achieved — best score 0.37% |
| Benchmark framing | Practical output matters | Generalization capability is the only valid test |
| Financial interest | Nvidia's valuation depends on AI maturity narrative | Independent researcher; prize not yet claimed by anyone |
"If a system cannot generalize to novel situations without instruction, it is expensive autocomplete — not general intelligence." — François Chollet, ARC Prize Foundation, March 2026
Yahoo Finance and Fortune both noted in their coverage that Huang's declaration is made by the CEO of the company that sells the hardware powering all AI development — a material conflict of interest that shapes how his claims should be weighted.
Where Other AI Leaders Stand
| Person | Organization | Position on AGI (March 2026) |
|---|---|---|
| Jensen Huang | Nvidia | Achieved — AI can run complex workflows commercially |
| François Chollet | ARC Prize Foundation | Not achieved — 0.37% on novel environment benchmark |
| Demis Hassabis | Google DeepMind | Approaching in narrow scientific domains |
| Dario Amodei | Anthropic | Within reach by 2026–2027 in specific knowledge domains |
| Yann LeCun | AMI Labs / Meta | Far from achieved — missing physical world models and common sense |
What This Means Practically
For people using AI tools today, the debate is somewhat academic. Current models are genuinely powerful for the tasks they were trained on: writing, coding, research synthesis, analysis, reasoning within familiar problem structures.
What they cannot reliably do is encounter a genuinely new kind of problem — one with no training data analogue — and figure out how to approach it from scratch. That gap is not a marketing footnote. It is a 99.63 percentage point gap between Gemini's best performance and human baseline on a benchmark specifically designed to measure it.
The $2 million ARC Prize has not been claimed. The benchmark is open. The gap remains.
Frequently Asked Questions
Did Nvidia's Jensen Huang declare AGI achieved? Yes. On March 23, 2026, Huang said on the Lex Fridman podcast: "I think it's now. I think we've achieved AGI." His definition requires an AI that can autonomously execute complex multi-step tasks and create commercial value — not the academic definition requiring generalization to novel situations.
What did ARC-AGI-3 measure and what were the scores? ARC-AGI-3, released March 26, 2026 by the ARC Prize Foundation, tests AI on 135 novel interactive environments with no training data overlap. The scoring metric (RHAE) also penalizes inefficiency. Humans scored 100%. Gemini 3.1 Pro scored 0.37% (highest AI score). GPT-5.4 scored 0.26%, Claude Opus 4.6 scored 0.25%, and Grok-4.20 scored 0%.
Why do AI models score so poorly on ARC-AGI-3? The benchmark removes all training advantages. Models cannot pattern-match to prior examples because none exist. ARC-AGI-3 requires true generalization — reasoning from scratch about novel environments — which current AI architectures cannot reliably do. Grok-4.20's zero score shows that memorized knowledge, while useful on standard benchmarks, provides zero benefit when confronting genuinely unseen problem types.
What is the ARC Prize and has anyone won it? The ARC Prize Foundation is offering $2 million for any AI system matching human performance on ARC-AGI-3. As of the end of March 2026, no model has come close. The benchmark withholds 110 of 135 environments from public access to prevent training on test data.
Sources
- Fortune — "Nvidia's Jensen Huang says 'We've achieved AGI.' But no one can agree on what that means"
- Decrypt — "Is AGI Here? Not Even Close, New AI Benchmark Suggests"
- Forbes — "Nvidia's Jensen Huang Says He Thinks 'We've Achieved AGI'"
- Winbuzzer — "ARC-AGI-3 Offers $2M for AI Matching Human Reasoning"
- ARC Prize Foundation — ARC-AGI-3 benchmark release, March 26, 2026

