New Benchmark Scores AI Models Near Zero

March 31, 2026

•

7 min read

Share this article

New Benchmark Scores AI Models Near Zero

Three days after Jensen Huang declared AGI achieved, ARC-AGI-3 scored every frontier model under one percent on novel interactive tasks humans solved perfectly.

Note: This article examines a contested debate with speculative elements. Events described reflect positions taken by named individuals; interpretations of AGI status remain actively disputed.

Summary

On March 23, 2026, Nvidia CEO Jensen Huang told Lex Fridman that artificial general intelligence has been achieved. On March 26 — three days later — the ARC Prize Foundation released ARC-AGI-3: 135 novel interactive environments that no AI model had seen during training. Humans solved them at 100% efficiency. The best AI model tested scored 0.37%. Grok-4.20 scored exactly zero. The debate is not about capability. It is about what "general" means.

The Scores

System	ARC-AGI-3 Score (RHAE)
Humans	100%
Google Gemini 3.1 Pro	0.37%
OpenAI GPT-5.4	0.26%
Anthropic Claude Opus 4.6	0.25%
xAI Grok-4.20	0.00%
ARC Prize for passing	$2,000,000

What Jensen Huang Said

On March 23, Huang made the most definitive public statement of his career on the topic:

"I think it's now. I think we've achieved AGI." — Jensen Huang, Nvidia CEO, Lex Fridman Podcast, March 23, 2026

Huang's definition of AGI is operational: an AI capable of executing sophisticated multi-step workflows, writing production-grade code, and — in principle — running a technology company to a $1 billion valuation without requiring a human to oversee each step. By that standard, he argues, Claude Code, GPT-5.4 with tool use, and multi-agent Grok configurations already qualify.

The statement landed in CNBC, Forbes, Fortune, and Yahoo Finance within hours. The response from the research community was skeptical.

Three Days Later: ARC-AGI-3

François Chollet — the creator of the original ARC-AGI benchmark and co-founder of the ARC Prize Foundation — published ARC-AGI-3 on March 26. The timing relative to Huang's declaration was not coincidental.

ARC-AGI-3 is built to test exactly what Huang's definition ignores: genuine generalization. The benchmark presents AI with 135 interactive environments that could not have appeared in any training data — novel problem spaces that require exploration and reasoning from scratch, with no instructions. The scoring metric, Relative Human Action Efficiency (RHAE), also penalizes inefficiency: solve the puzzle with ten times the actions a human would need, and you earn only 1% credit for that environment.

To prevent gaming, 110 of the 135 environments are withheld from public access. Only 25 are open for testing. No model has come anywhere near a score that would claim the $2 million prize.

Why the Scores Are So Low

The performance gap is not surprising to researchers who study AI generalization. Today's frontier models are extraordinarily capable on tasks that resemble their training distribution. They can write sophisticated code, synthesize complex documents, and solve math problems at or above PhD level — because they have seen millions of examples of such tasks.

ARC-AGI-3 removes that advantage entirely. The environments are designed to be unlike anything in any dataset. There are no instructions. There is no prior training data that maps to the structure of each puzzle. Performance requires the kind of flexible, exploratory reasoning that humans develop naturally and that current AI architectures do not have.

Grok's zero score is especially revealing. Grok-4.20 benchmarks well on standard tests that measure memorized knowledge and pattern matching. On ARC-AGI-3, it scored zero on every novel environment — indicating no ability to generalize beyond training, not even enough to make productive exploratory moves.

Two Definitions, One Unresolved Argument

The Huang-Chollet disagreement is structural, not factual. They are measuring different things.

	Jensen Huang	François Chollet
Definition of AGI	AI that executes complex workflows and creates commercial value at scale	AI that generalizes to novel situations without prior training, as any human does
Current AI status	Already achieved	Not achieved — best score 0.37%
Benchmark framing	Practical output matters	Generalization capability is the only valid test
Financial interest	Nvidia's valuation depends on AI maturity narrative	Independent researcher; prize not yet claimed by anyone

"If a system cannot generalize to novel situations without instruction, it is expensive autocomplete — not general intelligence." — François Chollet, ARC Prize Foundation, March 2026

Yahoo Finance and Fortune both noted in their coverage that Huang's declaration is made by the CEO of the company that sells the hardware powering all AI development — a material conflict of interest that shapes how his claims should be weighted.

Where Other AI Leaders Stand

Person	Organization	Position on AGI (March 2026)
Jensen Huang	Nvidia	Achieved — AI can run complex workflows commercially
François Chollet	ARC Prize Foundation	Not achieved — 0.37% on novel environment benchmark
Demis Hassabis	Google DeepMind	Approaching in narrow scientific domains
Dario Amodei	Anthropic	Within reach by 2026–2027 in specific knowledge domains
Yann LeCun	AMI Labs / Meta	Far from achieved — missing physical world models and common sense

What This Means Practically

For people using AI tools today, the debate is somewhat academic. Current models are genuinely powerful for the tasks they were trained on: writing, coding, research synthesis, analysis, reasoning within familiar problem structures.

What they cannot reliably do is encounter a genuinely new kind of problem — one with no training data analogue — and figure out how to approach it from scratch. That gap is not a marketing footnote. It is a 99.63 percentage point gap between Gemini's best performance and human baseline on a benchmark specifically designed to measure it.

The $2 million ARC Prize has not been claimed. The benchmark is open. The gap remains.

Frequently Asked Questions

Did Nvidia's Jensen Huang declare AGI achieved? Yes. On March 23, 2026, Huang said on the Lex Fridman podcast: "I think it's now. I think we've achieved AGI." His definition requires an AI that can autonomously execute complex multi-step tasks and create commercial value — not the academic definition requiring generalization to novel situations.

What did ARC-AGI-3 measure and what were the scores? ARC-AGI-3, released March 26, 2026 by the ARC Prize Foundation, tests AI on 135 novel interactive environments with no training data overlap. The scoring metric (RHAE) also penalizes inefficiency. Humans scored 100%. Gemini 3.1 Pro scored 0.37% (highest AI score). GPT-5.4 scored 0.26%, Claude Opus 4.6 scored 0.25%, and Grok-4.20 scored 0%.

Why do AI models score so poorly on ARC-AGI-3? The benchmark removes all training advantages. Models cannot pattern-match to prior examples because none exist. ARC-AGI-3 requires true generalization — reasoning from scratch about novel environments — which current AI architectures cannot reliably do. Grok-4.20's zero score shows that memorized knowledge, while useful on standard benchmarks, provides zero benefit when confronting genuinely unseen problem types.

What is the ARC Prize and has anyone won it? The ARC Prize Foundation is offering $2 million for any AI system matching human performance on ARC-AGI-3. As of the end of March 2026, no model has come close. The benchmark withholds 110 of 135 environments from public access to prevent training on test data.

Sources

Fortune — "Nvidia's Jensen Huang says 'We've achieved AGI.' But no one can agree on what that means"
Decrypt — "Is AGI Here? Not Even Close, New AI Benchmark Suggests"
Forbes — "Nvidia's Jensen Huang Says He Thinks 'We've Achieved AGI'"
Winbuzzer — "ARC-AGI-3 Offers $2M for AI Matching Human Reasoning"
ARC Prize Foundation — ARC-AGI-3 benchmark release, March 26, 2026

Published on March 31, 2026