Scientists develop new AI model that outperforms ChatGPT in key AGI benchmark tests

Read Time:3 Minute, 15 Second

It seems scientists are rapidly working towards building artificial intelligence models that resemble human brains in terms of reasoning. Reportedly, a new AI model is capable of advanced reasoning, unlike popular large language models (LLMs) such as ChatGPT. Scientists claim that they are seeing better performance in key benchmarks.

Scientists at Singapore-based AI company Sapient have named the new reasoning AI a hierarchical reasoning model (HRM), and it is reportedly inspired by the hierarchical and multi-timescale processing in the human brain. This is essentially the way different areas of the brain integrate information over varying durations, which range from milliseconds to minutes.

According to the scientists, the new reasoning model has demonstrated better performance than existing LLMs and is capable of working more efficiently. All this is reportedly possible owing to the model needing fewer parameters and training examples. The scientists claimed that the HRM model has 27 million parameters while it uses 1,000 training samples. Parameters in AI models are the variables learnt during the training, such as weights and biases. In contrast, most advanced LLMs come with billions or trillions of parameters.

How does it perform?

Story continues below this ad

When the HRM was tested in the ARC-AGI benchmark, which is known to be among the toughest tests to find out how close models are to attaining artificial general intelligence, the new model showed remarkable results, according to the study. The model scored 40.3 per cent in ARC-AGI-1, whereas OpenAI’s 03-mini-high had scored 34.5 per cent, Anthropic Claude 3.7 scored 21.2 per cent, and DeepSeek R1 scored 15.8 per cent. Similarly, in the more difficult ARC-AGI-2 test, HRM scored 5 per cent, surpassing the other models significantly.

While most advanced LLMs use chain-of-thought (CoT) reasoning, scientists at Sapient argued that this method has some key shortcomings, such as ‘brittle task decomposition, extensive data requirements, and high latency.’ On the other hand, HRM uses sequential reasoning tasks in a single forward pass and not step by step. It has two modules: a high-level module that performs slow and abstract planning and a low-level module that handles fast and detailed calculations. This is inspired by how different regions of the human brain handle planning vs quick reactions.

Moreover, HRM employs a method known as iterative refinement, meaning it starts with a rough answer and improves it over numerous short thinking bursts. Reportedly, after each burst, it checks if it needs to keep refining or if the results are good enough as the final answer. According to the scientists, HRM solved Sudoku puzzles that usually normal LLMs fail to do. The model also excelled at finding the best paths in mazes, demonstrating that it can handle structured and logical problems much better than LLMs.

While the results are remarkable, it needs to be noted that the paper, which has been published in the arXiv database, is yet to be peer-reviewed. However, the ARC-AGI benchmark team attempted to recreate the results after the model was made open-source. The team did confirm the numbers; however, they also found that the hierarchical architecture did not improve performance much as claimed. They found that a less-documented refinement process during the training was likely the reason for the strong numbers.