Strange Loop

Strange Loop

Share this post

Strange Loop
Strange Loop
Humanity’s Last Exam: What Happens When AI Becomes Too Smart to Test?
Copy link
Facebook
Email
Notes
More

Humanity’s Last Exam: What Happens When AI Becomes Too Smart to Test?

The hardest AI test ever created is already being outpaced. As Deep Research smashes records, we must rethink how we measure intelligence—before AI renders our exams, and our oversight, obsolete.

Adam Cunningham's avatar
Adam Cunningham
Feb 13, 2025
∙ Paid
3

Share this post

Strange Loop
Strange Loop
Humanity’s Last Exam: What Happens When AI Becomes Too Smart to Test?
Copy link
Facebook
Email
Notes
More
Share

👋🏼, I’m Adam. Apparently, I’m supposed to reintroduce myself in every post because that’s what Substack says. So here we are.

I write about the intersection of technology and… everything, but almost always through the lens of culture. I cover whatever technology is currently chewing up and spitting out, where it’s all headed, what’s breaking (most of it), and what comes next (lots!). Beyond talking about the collapse of the influencer economy (hooray!), the rise of synthetic media (oh no), AI-driven discovery (wha?), and why the web is slowly turning into a hyper-personalised hallucination machine (ruh roh)… I drop various AI experiments for full subscribers to play with and try to break down the most complex topics of the day into actionable next steps.

Consider subscribing for the full edition—or don’t, but then you’ll probably just get the AI-generated summary.

Artificial intelligence is advancing far faster than most realise.

Everyone’s chilling in Paris at the AI Action Summit and besides Vance proving that Americans still have difficulty with appropriately fitted suits, a lot’s been said…and the ✞ AI Trinity ✞ (OpenAI, Anthropic and Google DeepMind) are all on the same page. As Platformer put it:

  • We now know how to build artificial superintelligence, or ASI, and progress in building that intelligence is accelerating.

  • As a result, ASI will begin to arrive sometime between next year and 2030.

  • The arrival of ASI will bring with it extraordinary benefits, from curing human disease to solving climate change and eliminating global poverty.

  • But ASI will also bring extraordinary risks.

  • There are risks of what an individual person could do with superhuman intelligence: create novel chemical, biological, or nuclear weapons, for example, or engage in cyber warfare.

  • There are also risks of what a state empowered with ASI could do: enact total surveillance of its citizenry, take over the global economy, and dominate its enemies in warfare.

  • Finally, there is the risk that ASI will eventually ignore its creators and pursue its own objectives. It may come to ignore humans, or enslave them, or even eliminate them.

Now, maybe you don’t believe all or even any of that. But OpenAI CEO Sam Altman believes it, and so does Anthropic CEO Dario Amodei, and so does Google DeepMind CEO Demis Hassabis. And they spent several years telling anyone who would listen about it, including lawmakers and regulators, and urged them to craft regulations that would place guardrails around the development of AI before it was too late.

So.

Models are now acing exams once deemed impossible, forcing us to confront an existential question: How do we define and measure intelligence when our toughest tests become trivial for machines?

Today, we're talking about:

  • The creation of Humanity's Last Exam (HLE) and why it’s supposed to be AI’s ultimate challenge.

  • How OpenAI's Deep Research is obliterating expectations and setting new records.

  • Why traditional intelligence tests might soon be obsolete.

  • The implications of AI getting smarter than us (hint: it’s a bit unsettling).

  • What happens when AI is so good at everything that even making up harder tests becomes futile.

Explain It Like I’m 5:

Imagine you have the hardest puzzle ever. The kind that makes grown-ups cry. Now imagine a robot comes along, solves it in record time, and starts asking for a harder puzzle while eating a sandwich. That’s what’s happening with AI. We built the hardest test we could think of, and it’s already breezing through it. Now we’re stuck wondering: what’s next? And should we be worried?

Understanding Humanity’s Last Exam (HLE)

HLE is the academic equivalent of an unscalable mountain, designed by the Center for AI Safety (CAIS) and Scale AI to push AI to its limits. It consists of ~3,000 expert-level questions across 100+ disciplines, covering:

  • Mathematics (topology, number theory, statistics).

  • Natural sciences (quantum mechanics, biochemistry, astronomy).

  • Humanities (philosophy, linguistics, history).

  • Engineering and applied sciences (cryptography, electrical engineering, computer science).

The questions are sourced from nearly 1,000 domain experts across 500+ institutions, ensuring an unimpeachable gold standard of difficulty.

via CAIS

Why HLE Was Created

Traditional AI benchmarks have been quickly outpaced by modern AI. GPT-4, for instance, breezed through undergraduate-level exams (MMLU) with 90%+ scores, making them ineffective for measuring true intelligence.

Microsoft puts GPT-4 ahead of Gemini Ultra again, using Google's own tricks
via The Decoder

HLE was designed as an academic Everest—questions that require actual deep reasoning rather than just memorisation. No AI has yet reached human-expert performance across the board, which is why it remains the gold standard for evaluating AI intelligence. For now.

HLE’s Role in AGI Safety

HLE isn’t just a test—it’s an early warning system. If AI starts mastering HLE, it signals that we might be inching towards Artificial General Intelligence (AGI), a scenario where AI is capable across all domains, not just one. When that happens, we need to be ready with governance, oversight, and serious discussions about control mechanisms.

Share

Why Even Talk About This?

OpenAI's Deep Research and Rapid AI Progress

Deep Research is OpenAI’s latest AI model, designed to conduct complex multi-step reasoning autonomously. Unlike traditional large language models (LLMs), Deep Research:

  • Independently searches and analyses academic and technical materials.

  • Engages in iterative problem-solving to refine its answers.

  • Goes beyond static recall, demonstrating adaptive reasoning across disciplines.

via OpenAI: Watch it in Action

Deep Research’s Record-Breaking Performance on HLE

Upon launch:

  • ChatGPT o3-mini (~13%) achieved just over half the score of Deep Research.

  • DeepSeek (~12%), another high-performing model, was similarly eclipsed.

Deep Research recently achieved 26.6% accuracy on HLE, the highest ever recorded. This represents a 183% improvement in just a few weeks, a staggering leap that has dramatically shifted expectations about AI’s capability trajectory.

r/singularity - All Benchmark Results of Deep Research

👆🏼👆🏼👆🏼👆🏼👆🏼This is not nothing!

This exponential growth suggests that AI could surpass expert human performance on HLE within months rather than years. If this trend continues, static academic exams will no longer serve as an effective means of measuring AI intelligence.

Why Deep Research’s Performance Is a Game-Changer

Keep reading with a 7-day free trial

Subscribe to Strange Loop to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Adam Cunningham
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More