Leading AI models fail new test of artificial general intelligence

by h79snht.top · 09/16/2025

A new test of AI capabilities consists of puzzles that humans are able to solve without too much trouble, but which all leading AI models struggle with. To improve and pass the test, AI companies will need to balance problem-solving abilities with cost.

By Chris Stokel-Walker

25 March 2025

The ARC-AGI-2 benchmark is designed to be a difficult test for AI models

Just_Super/Getty Images

The most sophisticated AI models in existence today have scored poorly on a new benchmark designed to measure their progress towards artificial general intelligence (AGI) – and brute-force computing power won’t be enough to improve, as evaluators are now taking into account the cost of running the model.

There are many competing definitions of AGI, but it is generally taken to refer to an AI that can perform any cognitive task that humans can do. To measure this, the ARC Prize Foundation previously launched a test of reasoning abilities called ARC-AGI-1. Last December, OpenAI announced that its o3 model had scored highly on the test, leading some to ask if the company was close to achieving AGI.

The AI expert who says artificial general intelligence is nonsense

But now a new test, ARC-AGI-2, has raised the bar. It is difficult enough that no current AI system on the market can achieve more than a single-digit score out of 100 on the test, while every question has been solved by at least two humans in fewer than two attempts.

In a blog post announcing ARC-AGI-2, ARC president Greg Kamradt said the new benchmark was required to test different skills from the previous iteration. “To beat it, you must demonstrate both a high level of adaptability and high efficiency,” he wrote.

The ARC-AGI-2 benchmark differs from other AI benchmark tests in that it focuses on AI models’ abilities to complete simplistic tasks – such as replicating changes in a new image based on past examples of symbolic interpretation – rather than their ability to match world-leading PhD performances. Current models are good at “deep learning”, which ARC-AGI-1 measured, but are not as good at the seemingly simpler tasks, which require more challenging thinking and interaction, in ARC-AGI-2. OpenAI’s o3-low model, for instance, scores 75.7 per cent on ARC-AGI-1, but just 4 per cent on ARC-AGI-2.

Leading AI models fail new test of artificial general intelligence

You may also like...

Recent Posts

Recent Comments

Leading AI models fail new test of artificial general intelligence

You may also like...

Why has Virgin Orbit shut down and what will happen to UK spaceports?

Retinal images could predict future risk of heart or lung disease

Brain surgery before birth fixes abnormal blood vessel in fetus

Recent Posts

Recent Comments