AI Achieves Human-Level Intelligence on New Benchmark Test
Scientists have announced that a new artificial intelligence (AI) model has reached human-level results on a test aimed at measuring what is known as "general intelligence".
On December 20, OpenAI's o3 system achieved an impressive score of 85% on the ARC-AGI benchmark. This score is significantly higher than the previous best score of 55% held by other AI models, aligning it with the average performance observed in humans. Furthermore, o3 excelled in a particularly challenging mathematics test.
The ambitious goal of creating artificial general intelligence (AGI) is a common objective among leading AI research labs. At a glance, OpenAI appears to have taken a notable step toward realizing this vision.
Despite some skepticism, there is a growing sentiment among AI researchers and developers that a significant change is underway. The dream of AGI now seems more tangible, pressing, and closer than many had previously expected. But the question remains: are these individuals correct in their beliefs?
Understanding General Intelligence
To grasp the implications of the o3 model's performance, it is critical to understand the nature of the ARC-AGI benchmark. Essentially, this test evaluates how efficiently an AI system learns and adapts to new situations, which is termed "sample efficiency." In simpler terms, it looks at how many examples a system needs to successfully grasp and address a new challenge.
Conversely, a model like ChatGPT (GPT-4) is not particularly sample efficient. It was trained on millions of text examples, using statistical rules to ascertain the likelihood of specific word combinations.
This setup performs adequately in familiar tasks, but it struggles with less common problems, as it has limited exposure to those scenarios.
The true potential of AI systems will only be realized when they can learn from fewer examples and display greater sample efficiency. Until then, their applications will largely be restricted to repetitive tasks or situations where occasional inaccuracies can be tolerated.
The ability to solve previously unencountered problems using minimal data exemplifies the capacity to generalize—widely regarded as a core component of intelligence.
Learning Through Patterns
The ARC-AGI benchmark utilizes simple grid problems to assess an AI's sample efficient adaptation. For each question, the AI system is provided with three examples and tasked with discerning the underlying rules that allow it to generalize those examples to solve a fourth.
This problem-solving approach resembles IQ tests commonly encountered in educational settings.
Finding Adaptable Rules
While the exact methodologies employed by OpenAI remain unclear, the results suggest that the o3 model exhibits significant adaptability. From just a small number of examples, it can detect rules that can be broadly applied to new contexts.
The key is to avoid making needless assumptions or introducing specificities that are unnecessary. Theoretically, identifying the simplest or "weakest" rules that achieve the desired outcome enhances adaptability to new scenarios.
In layman's terms, a straightforward expression of one rule may be: "Any shape with a protruding line moves to the end of that line and overlaps any other shapes." This clarity enhances the AI's learning capabilities.
Exploring Thought Processes
Although detailed insights into how OpenAI has developed the o3 model remain elusive, it is possible that it was not specifically designed to identify weak rules. However, success on the ARC-AGI tasks indicates that the model must be effectively finding them.
What is known is that OpenAI initiated the o3 model as a general-purpose AI and subsequently specialized it for the ARC-AGI test.
Francois Chollet, a French AI researcher and creator of the benchmark, posits that o3 likely traverses different "chains of thought" when solving tasks. It evaluates these chains based on loosely defined rules, akin to how Google's AlphaGo analyzed potential sequences of moves in a game of Go.
You can think of these chains of thought as solutions adapted to fit the examples. If this process is similar to AlphaGo, then it needs an underlying heuristic to determine which approach is the best.
What Remains Unclear
The pressing question now is whether o3 truly brings us closer to AGI. If so, the model's improvements over previous iterations may not be substantial. The concepts it learns from language may not necessarily enhance its ability to generalize overall. What we may be witnessing is a more effective process of finding a generalizable "chain of thought" through extensive training.
Currently, much about o3 is not publicly known. OpenAI has only shared limited information through media presentations and early tests limited to select researchers and institutions focused on AI safety.
To fully grasp the implications of o3, comprehensive evaluations and analyses of its capabilities will be required, including its success and failure rates.
Once o3 is publicly launched, we are likely to have a clearer picture of whether it possesses adaptability similar to that of an average human.
If it does, we could witness transformative changes across various sectors, marking the beginning of a new phase in self-improving intelligence. This progression will prompt the necessity for new benchmarks for AGI and a thoughtful approach to governance.
However, should the o3 model fall short, it will still represent a remarkable achievement, albeit with little change to daily life as we know it.
AI, intelligence, benchmark